<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Enhancing Software Quality in Students' Programs •</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Second Workshop on Software Quality Analysis, Monitoring, Improvement and Applications SQAMIA 2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Harri Keto (Tampere Univ. of Technology</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Finland) Vladimir Kurbalija (Univ. of Novi Sad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Serbia) Anastas Mishev (Univ. of Ss. Cyril</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Methodius</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Skopje</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>FYR Macedonia) Sanjay Misra (Atilim Univ.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ankara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Turkey) Vili Podgorelec (Univ. of Maribor</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Slovenia) Zoltan Porkolab (Eotvos Lorand Univ.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Budapest</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bratislava</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Slovakia)</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Informatics Faculty of Sciences, University of Novi Sad</institution>
          ,
          <country country="RS">Serbia 2013</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Zoran Budimac University of Novi Sad Faculty of Sciences, Department of Mathematics and Informatics Trg Dositeja Obradovića 4</institution>
          ,
          <addr-line>21000 Novi Sad</addr-line>
          ,
          <country country="RS">Serbia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>2</volume>
      <issue>13</issue>
      <fpage>15</fpage>
      <lpage>17</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Applications</title>
    </sec>
    <sec id="sec-2">
      <title>SQAMIA 2013</title>
      <p>Proceedings
ISBN: 978-86-7031-269-2</p>
      <sec id="sec-2-1">
        <title>Preface</title>
        <sec id="sec-2-1-1">
          <title>This volume contains papers presented at the Second Workshop on Software Quality</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Analysis, Monitoring, Improvement, and Applications (SQAMIA 2013). SQAMIA 2013</title>
          <p>was held during September 15 - 17, 2013, at the Department of Mathematics and
Informatics, Faculty of Sciences, University of Novi Sad, Novi Sad, Serbia.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>SQAMIA 2013 is a continuation of the successful event held in 2012. Previous workshop,</title>
          <p>the first one, was organized within the 5th Balkan Conference in Informatics (BCI 2012) in</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>Novi Sad. In 2013 SQAMIA becomes standalone event in intention to become traditional</title>
          <p>meeting of the scientists and practitioners in the field of software quality.</p>
        </sec>
        <sec id="sec-2-1-5">
          <title>The main objective of SQAMIA workshop series is to provide a forum for presentation, discussion and dissemination of scientific findings in the area of software quality, and to promote and improve interaction and cooperation between scientists and young researchers from the region and beyond.</title>
        </sec>
        <sec id="sec-2-1-6">
          <title>The SQAMIA 2013 workshop consisted of regular sessions with technical contributions</title>
          <p>reviewed and selected by an international program committee, as well as of invited talks
presented by leading scientists in the research areas of the workshop.</p>
        </sec>
        <sec id="sec-2-1-7">
          <title>SQAMIA workshops solicited submissions dealing with four aspects of software quality: quality analysis, monitoring, improvement and applications. Position papers, papers describing the work-in-progress, tool demonstration papers, technical reports or other papers that would provoke discussion were especially welcome.</title>
        </sec>
        <sec id="sec-2-1-8">
          <title>In total, 13 papers were accepted and published in this proceedings volume. All pub</title>
          <p>lished papers were double reviewed, and some papers received the attention of more than
two reviewers. We would like to use this opportunity to thank all PC members and the
external reviewers for submitting careful and timely opinions on papers.</p>
        </sec>
        <sec id="sec-2-1-9">
          <title>Also, we gratefully acknowledge the program co-chairs, Tihana Galinac Grbac (Croa</title>
          <p>tia), Marjan Heričko (Slovenia), Zoltan Horvath (Hungary), Mirjana Ivanović (Serbia) and</p>
        </sec>
        <sec id="sec-2-1-10">
          <title>Hannu Jaakkola (Finland), for helping to greatly improve the quality of the workshop.</title>
        </sec>
        <sec id="sec-2-1-11">
          <title>We extend special thanks to the SQAMIA 2013 Organizing Committee from the Depart</title>
          <p>ment of Mathematics and Informatics, Faculty of Sciences, especially to its chair Gordana</p>
        </sec>
        <sec id="sec-2-1-12">
          <title>Rakić for her hard work, diligance and dedication to make this workshop the best it can be.</title>
        </sec>
        <sec id="sec-2-1-13">
          <title>Finally, we thank our sponsors, the Provincial Secretariat for Science and Technological</title>
        </sec>
        <sec id="sec-2-1-14">
          <title>Development, the Serbian Ministry of Education, Science and Technological Development,</title>
          <p>and the Department of Mathematics and Informatics, Faculty of Sciences, University of</p>
        </sec>
        <sec id="sec-2-1-15">
          <title>Novi Sad, for supporting the organization of this event.</title>
        </sec>
        <sec id="sec-2-1-16">
          <title>And last, but not least, we thank all the participants of SQAMIA 2013 for having made all work that went into SQAMIA 2013 worthwhile.</title>
        </sec>
        <sec id="sec-2-1-17">
          <title>September 2013</title>
          <p>Zoran Budimac</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Workshop Organization</title>
        <p>General Chair
Zoran Budimac (Univ. of Novi Sad, Serbia)
Program Chair
Program Co-Chairs
Zoran Budimac (Univ. of Novi Sad, Serbia)
Tihana Galinac Grbac (Univ. of Rijeka, Croatia)
Marjan Heričko (Univ. of Maribor, Slovenia)
Zoltan Horvath (Eotvos Lorand Univ., Budapest, Hungary)
Mirjana Ivanović (Univ. of Novi Sad, Serbia)
Hannu Jaakkola (Tampere Univ. of Technology, Pori, Finland)
Program Committee
Additional Reviewers
Roland Király (Eotvos Lorand Univ., Budapest, Hungary)
Miloš Radovanović (Univ. of Novi Sad, Serbia)
Organizing Committee (Univ. of Novi Sad, Serbia)</p>
        <sec id="sec-2-2-1">
          <title>Gordana Rakić, Chair</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Zoran Putnik</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>Miloš Savić</title>
          <p>Organizing Institution</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>Department of Mathematics and Informatics, Faculty of Sciences, University of Novi Sad, Serbia</title>
          <p>Sponsoring Institutions of SQAMIA 2013</p>
        </sec>
        <sec id="sec-2-2-5">
          <title>SQAMIA 2013 was partially financially supported by:</title>
        </sec>
        <sec id="sec-2-2-6">
          <title>Provincial Secretariat for Science and Technological Development,</title>
        </sec>
        <sec id="sec-2-2-7">
          <title>Autonomous Province of Vojvodina, Republic of Serbia</title>
        </sec>
        <sec id="sec-2-2-8">
          <title>Ministry of Education, Science and Technological Development,</title>
        </sec>
        <sec id="sec-2-2-9">
          <title>Republic of Serbia</title>
        </sec>
        <sec id="sec-2-2-10">
          <title>Department of Mathematics and Informatics, Faculty of Sciences,</title>
        </sec>
        <sec id="sec-2-2-11">
          <title>University of Novi Sad, Serbia iv</title>
          <p>Stability of Software Defect Prediction in Relation to
Levels of Data Imbalance
TIHANA GALINAC GRBAC AND GORAN MAU SˇA, University of Rijeka
BOJANA DALBELO–BASˇ IC´ , University of Zagreb
Software defect prediction is an important decision support activity in software quality assurance. Its goal is reducing veri cation
costs by predicting the system modules that are more likely to contain defects, thus enabling more e cient allocation of resources
in veri cation process. The problem is that there is no widely applicable well performing prediction method. The main reason
is in the very nature of software datasets, their imbalance, complexity and properties dependent on the application domain. In
this paper we suggest a research strategy for the study of the performance stability using di erent machine learning methods
over di erent levels of imbalance for software defect prediction datasets. We also provide a preliminary case study on a dataset
from the NASA MDP open repository using multivariate binary logistic regression and forward and backward feature selection.
Results indicate that the performance becomes unstable around 80% of imbalance.</p>
          <p>Categories and Subject Descriptors: D.2.9 [Software Engineering]: Management—Software quality assurance (SQA)
Additional Key Words and Phrases: Software Defect Prediction, Data Imbalance, Feature Selection, Stability
1. INTRODUCTION
Software defect prediction is recognized as one of the most important ways to reach software
development efficiency. The majority of costs during software development is spent on software defect detection
activities, but their ability to guarantee software reliability is still limited. The analyses performed by
[Andersson and Runeson 2007; Fenton and Ohlsson 2000; Galinac Grbac et al. 2013], in the
environment of a large scale industrial software with high focus on reliability shows that faults are distributed
within the system according to the Pareto principle. They prove that the majority of faults are
concentrated in just small amount of system modules, and that these modules do not compose a majority
of system size. This fact implies that software defect prediction would really bring benefits if a well
performing model is applied. The main motivating idea is that if we were able to predict the location
of software faults within the system, then we could plan defect detection activities more efficiently.
This means that we would be able to concentrate defect detection activities and resources into critical
locations within the system and not on the entire system.</p>
          <p>Numerous studies have already been performed aiming to find the best general software defect
prediction model [Hall et al. 2012]. Unfortunately, a well performing solution is still absent. Data in
software defect prediction are very complex, and do not follow in general any particular probability
distribution that could provide a mathematical model. Data distributions are highly skewed, which is
connected to the popular data imbalance problem, thus making standard machine learning approaches
inadequate. Therefore, a significant research has recently been devoted to cope with this problem.
Author’s address: T. Galinac Grbac, Faculty of Engineering, Vukovarska 58, HR–51000 Rijeka, Croatia; email: tgalinac@riteh.hr;
G. Mausˇa, Faculty of Engineering, Vukovarska 58, HR–51000 Rijeka, Croatia; email: gmausa@riteh.hr; B. Dalbelo–Basˇic´,
Faculty of Electrical Engineering and Computing, Unska 3, HR-10000 Zagreb, Croatia; email: bojana.dalbelo@fer.hr.
Copyright c by the paper’s authors. Copying permitted only for private and academic purposes.</p>
          <p>In: Z. Budimac (ed.): Proceedings of the 2nd Workshop of Software Quality Analysis, Monitoring, Improvement, and Applications
(SQAMIA), Novi Sad, Serbia, 15.-17.9.2013, published at http://ceur-ws.org
1:2
Several solutions are offered for the data imbalance problem. However, these solutions are not equally
effective in all application domains. Moreover, there is still an open question regarding the extent to
which imbalanced learning methods help with learning capabilities. This question should be answered
with extensive and rigorous experimentation across all application domains, including software defect
prediction, aiming to explore underlaying effects that would lead to fundamental understandings [He
and Garcia 2009].</p>
          <p>The work presented in this paper is a step in that direction. We present an research strategy that
aims to explore performance stability of software defect prediction models in relation to levels of data
imbalance. As an illustrative example we present an experiment taken to Stability of Software Defect
Prediction in Relation to Levels of Data Imbalance our strategy. We observed how learning
performance, with and without stepwise feature selection, in case of logistic regression learner, is changing
over a range of imbalances in the context of software defect prediction. The findings are just indicative
and are to be explored by exhausting experimenting aligned with proposed strategy.
1.1 Complexity of software defect prediction data
Software defect prediction (SDP) is concerned with early prediction of system modules (file, class,
module, method, component, or something else) that are likely to have a critical number of faults
(above certain threshold value, THR). In numerous studies it is identified that these modules are not so
common. In fact, they are special cases, and that is why they are harder to find. Dependent variable
in learning models is usually a binary variable with two classes labeled as ’fault–prone’ (FP) and
’not–fault–prone’ (NFP). The number of FP modules usually is much lower, and represents a minority
class, than the number of NFP modules which represents a majority class. Datasets with significantly
unequal distributions of minority over majority class are imbalanced. Independent variables used
in SDP studies are numerous. In this paper we will address SDP based on the static code metrics
[McCabe 1976].</p>
          <p>In SDP datasets the level of class imbalance varies for various software application domains. We
reviewed the software engineering publications dealing with software defect prediction and we noticed
that the percentage of the non-fault prone modules (%NFP) in the datasets varies a lot (from 1% in
medical record system [Andrews and Stringfellow 2001] to more then 94% in telecom system
[Khoshgoftaar and Seliya 2004]) for various software application domains (telecom industry, aeronautics,
radar systems, etc.). Since there are SDP initiatives on datasets with a whole range of imbalance
percentages, we are motivated to determine the percentage at which data imbalance becomes a problem,
i.e., learners become unstable.</p>
          <p>As already mentioned above, the random variables measured in software engineering usually do not
follow any distribution in general, and the applicability of classical mathematical modeling methods
and techniques is limited. Hence, algorithms from the machine learning have been widely adopted.
Among various learning methods used in the defect prediction approaches, this paper will explore the
capabilities of multivariate binary logistic regression (LR). Our ultimate goal is not to validate
different learning algorithms but to explore learning performance stability over different levels of
imbalance. The LR has shown very good performance in the past and is known to be a simple but
robust method. In [Lessmann et al. 2008] it is the 9th best classifier among 22 examined (9/22) and at
the same time it is the 2nd best statistical classifier among 7 of them (2/7). The stepwise regression
classifier was the most accurate classifier (1/4) and was outperformed only in cases with many outliers
in [Shepperd and Kadoda 2001]. Very good performance of logistic regression was also observed in
[Kaur and Kaur 2012] (3/12 it terms of accuracy and AUC), [Banthia and Gupta 2012] (1/5 both with
and without preprocessing of 5 raw NASA datasets), [Giger et al. 2011] (1/8 in terms of median AUC
from 15 open source projects), [Jiang et al. 2008] (2/6 in terms of AUC and 3/6 according to Nemenyi
post-hoc test), etc. However, neither of the studies has analyzed the performance of logistic regression
classifier in relation to data imbalance. The study [Provost 2000] assumes that in majority of published
work the performance of logistic learner would be significantly improved, if it is adequately used. We
will refer to this issue in more detail in Section 3.</p>
          <p>As in the whole software engineering field, an important problem in software defect prediction is the
lack of quality industrial data, and therefore generalization ability and further propagation of research
results is very limited. The problem is usually that this data are considered as confidential by the
industry, or the data are not available at all for industry with low maturity. To overcome these obstacles,
there are initiatives for open source repositories of datasets aligned with the goal of improving
generalization of research results. However, the problem of generalization still remains, because usually
the open repositories contain data from a particular type of software (e.g. NASA MDP repository, open
source software repositories, etc.) and/or of questionable quality [Gray et al. 2011].</p>
          <p>In this study we used NASA MDP datasets and have carefully addressed all the potential issues,
i.e. removed duplicates [Gray et al. 2012]. This selection is motivated by simple comparison of results
with the related work, so that our contribution can be easily incorporated to the existing knowledge
base of imbalance problem in the SDP area.
1.2 Experimental approach
Our goal is to explore stability of evaluation metrics for learning SDP datasets with machine learning
techniques across different levels of imbalance. Moreover, we want to evaluate potential sources of
bias in study design by constructing number of experiments in which we diverse one parameter per
experiment. Parameters that are subject of change are explained briefly in Sect.2.</p>
          <p>To integrate conclusions obtained from each experiment a meta–analytic statistical analysis is
proposed. These methods are suggested by number of authors as tool for generalizing the results and
integrating knowledge across many studies [Brooks 1997]. We propose the following steps:
(1) Acquiring data. A sample S of independent random variables X1; : : : ; Xn measuring different
features of a system module, and a binary dependent variable Y measuring fault–proneness (with
Y = 1 for FP modules and Y = 0 for NFP modules) is obtained from a repository (e.g. open
repository, open source projects, industrial projects).
(2) Data preprocessing.</p>
          <p>(a) Data cleaning, noise elimination, sampling.
(b) Data multiplication. From the sample S obtained in step (1) a training set of size 2=3 the size
of S and a validation set of size 1=3 the size of S are chosen at random k times. In this way
k training samples T1; : : : ; Tk and k validation samples V1; : : : ; Vk are obtained. These samples
are categorized into ` categories with respect to the data imbalance defined as the percentage
F PTi
of the NFP modules in Ti and calculated as: %N F PTi = F PTi +NF PTi .
(c) Feature selection. For each training set Ti a feature selection is performed. As a result some of
the random variables Xj are excluded from the model. The inclusion/exclusion frequencies of
Xj for each of the categories introduced in step (2b) are recorded.
(3) Learning.</p>
          <p>(a) Building a learning model. A learning model is built for each training set Ti using the learning
techniques under consideration.
(b) Evaluating model performance. Using the validation set Vi, the model built in step (3a) is
evaluated using various evaluation metrics. Let M be the random variable measuring the value of
one of these metrics.
(4) Statistical analysis.
(a) Variation analysis. The differences between ` samples of a random variable M obtained from
samples Ti and Vi belonging to different categories introduced in step (2b) are analyzed using
statistical tests. This step is repeated for each evaluation measure used in step (3b).
(b) Cross-dataset validation. The whole process is repeated from step (1) for m datasets from
various application domains and sources. The differences between ` m samples of a random variable
M are analyzed using statistical tests and the results reveal whether general behavior exists.</p>
          <p>To summarize, the conclusions are based on the results of statistical tests comparing the mean values
of performance evaluation metrics (see Table I) across different data imbalances of a training sample.
The stability of performance evaluation metrics obtained with different feature selection procedures is
evaluated in the same way.
2. DATA IMBALANCE
Data imbalance has received considerable attention within the data mining community during the last
decade. It becomes a central point of this research, since the problem is present in a majority of data
mining application areas [Weiss 2004]. In general data imbalance degrades the learning performance.
The problem arises with learning accuracy of the minority class, in which we are usually more
interested. Usually, we are interested to timely predict rare events represented by the minority class, for
which the probability of its occurrence is low, but its occurrence leads to significant costs.</p>
          <p>For example, suppose that only very low number of system modules is faulty, which is the case with
systems with very low tolerance on failures (e.g. medical systems, aeronautic system,
telecommunications, etc.). Suppose that we did not identify faulty module with the help of a software defect prediction
algorithm, and due to that have developed defect detection strategy not concentrating on that
particular module. Thus, we omit to identify a fault in our defect detection activity, and this fault slips to the
customer site. Failure caused by this fault at customer site would then imply significant costs contained
of several items: paying penalty to customer, losing customer confidence, causing additional expenses
due to corrective maintenance, additional costs in all subsequent system revisions and additional cost
during system evolution. This cost would be considered as misclassification cost of wrongly classified
positive class (note that positive class in the context of defect prediction algorithm is a faulty module).
On the other hand, misclassification cost of wrongly classified negative class would be much lower,
because it would involve just more defect detection activities. Obviously, the misclassification costs are
unequally weighted and this is the main obstacle in applying standard machine learning algorithms,
because they usually assumes the same or similar conditions in learning and application environment
[Provost 2000].</p>
          <p>The study [Provost 2000] makes a survey of data imbalance problems and methods addressing these
problems. Although different methods are recommended for data imbalance problems, it does not give
definite answers regarding their applicability in the application context. Some answers are obtained
by other researchers in that field afterwards, and a more recent survey is given in [He and Garcia
2009]. Still no definite guideline exists that could guide practitioners.
2.1 Dataset considerations
The most popular approach to the class imbalance problem is the usage of artificially obtained balanced
dataset. There are several sampling methods proposed for that purpose. In a recent work [Wang and
Yao 2013] an experiment with some of the sampling methods is conducted. However, it is concluded in
[Kamei et al. 2007] that sampling did not succeed to improve performance with all the classifiers. In
[Hulse et al. 2007] it is identified that classifier performance is improved with sampling, but individual
learners respond differently on sampling.</p>
          <p>Another problem with datasets is that in practice, the datasets are often very complex, involving a
number of issues like overlapping, lack of representative data, within and between class imbalance,
and often high dimensionality. The effects of these issues were widely analyzed separately sample size
in [Raudys and Jain 1991], dimensionality reduction: [Liu and Yu 2005], noise elimination
[Khoshgoftaar et al. 2005], but not in conjunction with the data imbalance. The study performed in [Batista
et al. 2004] observes that the problem is related to a combination of absolute imbalance and other
complicating factors. Thus, the imbalance problem is just an additional issue in complex datasets such
as datasets for software defect prediction.</p>
          <p>Different aspects of feature selection in relation to class imbalance has been studied in
[Khoshgoftaar et al. 2010; Gao and Khoshgoftaar 2011; Wang et al. 2012]. All these studies were performed on
datasets from the NASA MDP repository. In this work we also used a stepwise feature selection as a
preprocessing step, because the dataset is high dimensional and we experiment with logistic
regression. Hence, we were able to investigate the stability of the performance with and without feature
selection procedure over different levels of imbalance.</p>
          <p>Besides the methods explained above for obtaining artificially balanced datasets, another approach
is to adapt standard machine learning algorithms to operate for imbalance datasets. In that case
the learning approach should be adjusted to the imbalanced situation. A complete review of such
approaches and methods can be found in [He and Garcia 2009].
2.2 Evaluation metrics
Another problem of standard machine learning algorithms for imbalanced data is in usage of
inadequate evaluation metrics during learning procedure or to evaluate final result. Evaluation metrics
are usually derived from the confusion matrix and are given in Table I. They are defined in terms
of the following score values. A true positive (TP) score is counted for every correctly (true) classified
fault-prone module, and a true negative (TN) score for every correctly (true) classified non-fault-prone
module. The other two possibilities are related to false prediction. A false positive (FP) score is counted
for every false classified or misclassified non-fault-prone module (often referred to as Type II error),
and a false negative (FN) score is counted for every false classified or misclassified fault-prone module
(often referred to as Type I error) [Runeson et al. 2001; Khoshgoftaar and Seliya 2004]. For example,
classification accuracy ACC, the most commonly used evaluation metric in standard machine
learning algorithms, is not able to value the minority class appropriately, and leads to poor classification
performance of minority class.</p>
          <p>In the case of class imbalance, the precision (PR) and recall (TPR) metrics given in Table I are
recommended in number of studies [He and Garcia 2009], as well as the F –measure and G–mean
which are not used here. The precision and recall in combination give a measure of correctly classified
fault–prone modules. Precision measures exactness, i.e., how many fault–prone modules are classified
correctly, and recall measures completeness, i.e., how many fault–prone are classified correctly.</p>
          <p>Metrics
Accuracy (ACC)
True positive rate (TPR)
(sensitivity, recall)
Precision (PR)
(positive predicted value)</p>
          <p>The output of a probabilistic machine learning classifier is the probability for a module to be
faultprone. Therefore, a cutoff percentage has to be defined in order to perform classification. Since choosing
a cutoff value leaves room to bias and possible inconsistencies in a study [Lessmann et al. 2008], there
is another measure that deals with that problem called the area under curve, AUC [Fawcett 2006].
It takes into account the dependence of T P R and a similar metric for false positive proportion on the
cutoff value.</p>
          <p>All of the aforementioned techniques are not cost sensitive, and in the case of rare cases with very
high misclassification cost of type I error the key performance indicator is cost. The most favorable
evaluation criteria for imbalanced datasets are cost curves and is also recommended in [Jiang et al.
2008] for SDP domain.
3. PRELIMINARY CASE STUDY
To illustrate the application of the research strategy proposed in Section 1.2, verify strategy, provide
evidence for the dependence of the machine learning performance on the level of data imbalance, and
indicate our future goals, we have undertaken a preliminary case study.
(1) Dataset KC1 from NASA MDP repository has been acquired. It consists of n = 29 features, i.e.,
independent variables Xj. The dependent variable in this dataset is the number of faults in a
system module. From this variable we derived binary dependent variable Y by setting ten different
thresholds for fault proneness, from 1 to 19 with step of 2 (1, 3, 5,...). In this way we obtained ten
different samples S and we continue the analysis for all of them.
(2) (a) The well known issues with the dataset are eliminated using data cleaning tool [Shepperd et al.</p>
          <p>2013].
(b) For each of the ten samples obtained in step (1), we made 50 iterations of the random splitting
into training and validation samples. Thus we obtained k = 500 samples Ti and Vi with the
range of data imbalance from 51% to 96%. The samples are categorized into ` = 5 categories of
equal length (each spanning 9%).
(c) In the case study we also consider the influence of a feature selection procedure, as already
mentioned in 2. We consider the forward and backward stepwise selection procedure [Han and
Kambar 2006]. The decision for inclusion and exclusion of a feature is based on level of
statistical significance, the p value. The common significance levels for inclusion and exclusion of
features are used as in [Mausa et al. 2012; Briand et al. 2000] with p in = 0:05 and p out = 0:1
respectively. The percentage of inclusion of a feature for both procedures and different
categories of data imbalance are given in Table II. We conclude that feature selection stability of
some features is very tolerant to data imbalance (e.g. Feature 5, 22, 28, 29 is always excluded,
for both forward and backward model). Some features are very stable until certain level of
balance (for example Feature 2 is always included 100% until category with data imbalance of
78%). It is also interesting to observe that some features have similar feature selection
stability in ideal balance case and highly imbalanced case, whereas for moderate imbalance have
opposite feature selection decision.
(3) (a) Learning models are built using multivariate binary logistic regression (LR) [Hastie et al.
2009]. The model incorporates more than just one predicting variable and in fault predicting
case performs according to the equation
(X1; X2; :::Xn) = 1 + eC0+C1X1+:::+CnXn ;
eC0+C1X1+:::+CnXn
(1)
where Cj are the regression coefficients corresponding to Xj, and is the probability that a
fault was found in a class during validation. In order to obtain a binary outgoing variable,
Ft. 1
Ft. 2
Ft. 3
Ft. 4
Ft. 5
Ft. 6
Ft. 7
Ft. 8
Ft. 9
Ft. 10
Ft. 11
Ft. 12
Ft. 13
Ft. 14
Ft. 15
Ft. 16
Ft. 17
Ft. 18
Ft. 19
Ft. 20
Ft. 21
Ft. 22
Ft. 23
Ft. 24
Ft. 25
Ft. 26
Ft. 27
Ft. 28
Ft. 29
a cutoff value splits the results into two categories. Researchers often set the cutoff value to
imbalance and this robustness is achieved with setting of cutoff value to optimal value
dependent on misclassification costs [Basili et al. 1996]. Our goal is to explore learning performance
over different imbalance levels. However, in this study, due to space limitation, we provide
preliminary results exploring learning performance stability of standard learning algorithms.
Therefore, we provide results of experiments with cutoff value set to 0.5 (that is how standard
learning algorithms equally weight misclassification costs). We considered there three
different models (with forward feature selection, backward feature selection and without feature
selection) and for each of these models, the coefficients are calculated separately.
(b) For all validation samples from step (2b) we count the TN, TP, FN and FP scores of the
corresponding model, and calculate the learning performance evaluation metrics ACC, TPR (Recall),
AUC and Precision using formulas in Table I.
(4) We made a statistical analysis of the behavior of evaluation metrics measured in step (3b) between
different categories introduced in step (2c). Since the samples are not normally distributed, we used
the non-parametric tests. The Kruskal-Wallis test showed for all metrics that the values depend
on the category. To explore the differences further, we applied multiple comparison test. It reveals
that all considered evaluation metric become unstable at the level of imbalance of 80%. According
to the theory explained in section 2, we expect that we will get significantly different mean values
for all metrics in category of highest data imbalance (90% - 100%).
1:8
4. DICUSSION
Data imbalance problem has been widely investigated and there were numerous approaches studying
its effects aiming to propose a general solution to that problem. However, from the experiments in
machine learning theory it becomes obvious that this is not only related to proportion of minority over
majority class but there are also other influences present in complex datasets. As the datasets in
software defect prediction (SDP) research area are usually extremely complex, there is a huge unexplored
area of research related to applicability of these techniques in relation to the level of data imbalance.
That is exactly our main motivation for this work.</p>
          <p>There are many approaches, depending on particular dataset, to SDP and development of the
learning model. Since we are interested in the performance stability of machine learners over SDP datasets,
we should rigorously explore the strengths and limitations of these approaches in relation to the level
of data imbalance. Therefore, we present an exploratory research strategy and an example of a case
study performed according to this strategy. Although, we use our experiment to eliminate as much as
possible inconsistencies and threats of applying the strategy, there is still place for improvement.</p>
          <p>In our case study we present how performance stability is significantly degraded at a higher level of
imbalance. This confirms the results obtained by other researchers using different approaches. That
conclusion have proved reliability of our strategy. Moreover, with the help of our research strategy we
confirmed that feature selection becomes instable with higher data imbalance. We have also observed
that the feature selection is consistent across levels of imbalance for some features.</p>
          <p>Future work should involve extensive exploration of SDP datasets with the proposed strategy. Our
vision is that at the end we can gain deeper knowledge about imbalanced data in SDP and applicability
of techniques in different levels of imbalance. Finally, we would like to categorize datasets using the
proposed strategy and results of this exhaustive research that would serve as a guideline for
practitioners while developing software defect prediction model.
D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. The misuse of the nasa metrics data program data sets for automated
software defect prediction. Processing, pages 96–103, 2011.</p>
          <p>D. Gray, D. Bowes, N. Davey, Y. Sun and B. Christianson. Reflections on the NASA MDP data sets. IET Software, pages 549
5583, 2012.</p>
          <p>T. Galinac Grbac, P. Runeson, and D. Huljenic. A second replicated quantitative analysis of fault distributions in complex
software systems. IEEE Transactions on Software Engineering, 39(4):462–476, 2013.</p>
          <p>T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in
software engineering. Software Engineering, IEEE Transactions on, 38(6):1276–1304, 2012.</p>
          <p>J. Han and M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
2006.</p>
          <p>T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference and prediction. Springer,
2 edition, 2009.</p>
          <p>H. He and E. A. Garcia. Learning from Imbalanced Data. IEEE Trans. Knowledge and Data Engineering, 21(9):1263-1284,
2009.</p>
          <p>J. Hulse, T. Khoshgoftaar, A. Napolitano. Experimental perspectives on learning from imbalanced data. In in Proc. 24th
international conference on Machine learning (ICML ’07), pages 935–942. 2007.</p>
          <p>Y. Jiang, B. Cukic, and Y. Ma. Techniques for evaluating fault prediction models. Empirical Softw. Engg., 13:561–595, October
2008.</p>
          <p>Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, K. Matsumoto. The Effects of Over and Under Sampling on Fault-prone
Module Detection. In in Proc. ESEM 2007, First International Symposium on Empirical Software Engineering and Measurement,
pages 196–201. IEEE Computer Society Press, 2007.</p>
          <p>I. Kaur and A. Kaur. Empirical study of software quality estimation. In Proceedings of the Second International Conference
on Computational Science, Engineering and Information Technology, CCSEIT ’12, pages 694–700, New York, NY, USA, 2012.</p>
          <p>ACM.</p>
          <p>T. M. Khoshgoftaar, E. B. Allen, R. Halstead, and G. P. Trio. Detection of fault-prone software modules during a spiral life cycle.</p>
          <p>In Proceedings of the 1996 International Conference on Software Maintenance, ICSM ’96, pages 69–76, Washington, DC, USA,
1996. IEEE Computer Society.</p>
          <p>T. M. Khoshgoftaar and N. Seliya. Comparative assessment of software quality classification techniques: An empirical case
study. Empirical Softw. Engg., 9(3):229–257, Sept. 2004.</p>
          <p>T. M. Khoshgoftaar, N. Seliya, K. Gao. Detecting noisy instances with the rule-based classification model. Intell. Data Anal.,
9(4):347–364, 2005.</p>
          <p>T. M. Khoshgoftaar, K. Gao, N. Seliya. Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction In</p>
          <p>Proceedings: the 22nd IEEE International Conference on Tools with Artificial Intelligence, 137-144, 2010.</p>
          <p>H. Liu, L. Yu. Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Trans. on Knowl. and</p>
          <p>Data Eng., 17(4):491–502, 2005.</p>
          <p>S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking classification models for software defect prediction: a proposed
framework and novel findings. IEEE Transactions on Software Engineering, 34(4):485–496, 2008.</p>
          <p>G. Mausa, T. Galinac Grbac, and B. Basic. Multivariate logistic regression prediction of fault-proneness in software modules. In</p>
          <p>MIPRO, 2012 Proceedings of the 35th International Convention, pages 698–703, 2012.</p>
          <p>T.J. McCabe. 1976. A complexity measure. IEEE Transactions on Software Engineering, 2:308–320, 1976.</p>
          <p>N. Ohlsson, M. Zhao, and M. Helander. Application of multivariate analysis for software fault prediction. Software Quality</p>
          <p>Control, 7:51–66, May 1998.</p>
          <p>F. Provost. Machine Learning from Imbalanced Data Sets 101. In Proc. Learning from Imbalanced Data Sets: Papers from the</p>
          <p>Am. Assoc. for Artificial Intelligence Workshop, Technical Report WS-00-05, 2000.</p>
          <p>S. J. Raudys, A. K. Jain. Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners. IEEE</p>
          <p>Trans. Pattern Anal. Mach. Intell., 13(3):252–264, May 1991.</p>
          <p>P. Runeson, M. C. Ohlsson, and C. Wohlin. A classification scheme for studies on fault-prone components. In Proceedings of the
Third International Conference on Product Focused Software Process Improvement, PROFES ’01, pages 341–355, London, UK,
2001. Springer-Verlag.</p>
          <p>M. Shepperd and G. Kadoda. Comparing software prediction techniques using simulation. IEEE Trans. Softw. Eng., 27(11):1014–
1022, Nov. 2001.</p>
          <p>M. Shepperd, Q. Song, Z. Sun, C. Mair Data Quality: Some Comments on the NASA Software Defect Data Sets. IEEE Trans.</p>
          <p>Softw. Eng., http://doi.ieeecomputersociety.org/10.1109/TSE.2013.11, Nov. 2013.
1:10
H. Wang, T. M. Khoshgoftaar, and A. Napolitano. An Empirical Study on the Stability of Feature Selection for Imbalanced
Software Engineering Data. In Proceedings of the 2012 11th International Conference on Machine Learning and Applications
- Volume 01, ICMLA ’12, pages 317–323, Washington, DC, USA, 317–323.</p>
          <p>S. Wang and X. Yao. Using Class Imbalance Learning for Software Defect Prediction. IEEE Transactions on Reliability, 62(2):434
- 443, 2012.</p>
          <p>G.M. Weiss. Mining with rarity: a unifying framework. In SIGKDD Explor. Newsl., 6(1):7–19, 2004.</p>
          <p>T. Zimmermann and N. Nagappan. Predicting defects using network analysis on dependency graphs. In Proceedings of the 30th
international conference on Software engineering, ICSE ’08, pages 531–540, New York, NY, USA, 2008. ACM.
Enhancing Software Quality in Students’ Programs
STELIOS XINOGALOS, University of Macedonia
MIRJANA IVANOVIĆ, University of Novi Sad
This paper focuses on enhancing software quality in students’ programs. To this end, related work is reviewed and proposals for
applying pedagogical software metrics in programming courses are presented. Specifically, we present the main advantages and
disadvantages of using pedagogical software metrics, as well as some proposals for utilizing features already built in contemporary
programming environments for familiarizing students with various software quality issues. Initial experiences on usage of software
metrics in teaching programming courses and concluding remarks are also presented.</p>
          <p>Categories and Subject Descriptors: D.2.8 [Software Engineering]: Metrics – Complexity measures; K.3.2 [Computers and
Education]: Computer and Information Science Education – Computer science education
General Terms: Education, Measurement
Additional Key Words and Phrases: Pedagogical Software Metrics, Quality of Students’ Software Solutions, Assessments of
Students’ Programs
1. INTRODUCTION
Teaching and learning programming presents teachers and students respectively with several challenges.
Students have to comprehend the basic algorithmic/programming constructs and concepts, acquire
problem solving skills, learn the syntax of at least one programming language, and familiarize with the
programming environment and the whole program development process. Moreover, students nowadays
have to familiarize with the imperative and object-oriented programming techniques and utilize them
appropriately. The numerous difficulties encountered by students regarding these issues have been
recorded in the extended relevant literature. Considering time restrictions, large classes and increasing
dropout rates the chances to add more important software development aspects in introductory
programming courses, such as software quality aspects, seems to be a difficult mission.</p>
          <p>On the other hand, several empirical studies regarding the development of real-world software systems
have shown that 40% to 60% of the development resources are spent on testing, debugging and
maintenance issues. It is clear both for the software industry and those teaching programming that the
students should be educated to write code of better quality. Several efforts have been made from
researchers and teachers towards achieving this goal. These efforts focus mainly on:
− adjusting widely accepted software quality metrics for use in a pedagogical context,
− devising special tools that carry out static code analysis of students’ programs.</p>
          <p>This paper focuses on studying the related work and making some proposals for dealing with software
quality in students’ programs. Specifically, we propose utilizing features already built in contemporary
programming environments used in our courses, for presenting and familiarizing students with various
software quality issues without extra cost. Of course, using pedagogical software metrics is not an issue
that refers solely to pure programming courses. However, since students formulate their programming
style in the context of introductory programming courses, it is important to introduce pedagogical
software metrics in such courses and then extend on other software engineering, information systems and
database courses, or generally in courses that require from students to develop software. The rest of the
This work was partially supported by the Serbian Ministry of Education, Science and Technological Development through project
Intelligent Techniques and Their Integration into Wide-Spectrum Decision Support, no. OI174023 and by the Provincial Secretariat
for Science and Technological Development through multilateral project “Technology Enhanced Learning and Software
Engineering”.</p>
          <p>Authors’ addresses: S. Xinogalos, Department of Applied Informatics, School of Information Sciences, 156 Egnatia str., 54006
Thessaloniki, Greece, email: stelios@uom.gr; M. Ivanović, Department of Mathematics and Informatics, Faculty of Sciences, Trg
Dositeja Obradovića 4, 21000 Novi Sad, Serbia, email: mira@dmi.uns.ac.rs
paper is organized as follows. In the 2nd section we refer to adequate related work. Section 3 considers
usage of pedagogical software metrics. In section 4 we present some initial experiences of usage of
software metrics in teaching programming courses. Last section brings concluding remarks.
2. RELATED WORK
When we refer to commercial software quality a long list of software metrics exists that includes basic
metrics and more elaborated ones, as well as combinations and variations of them. Highly referenced
basic metrics are: (i) the Healstead metric that is used mainly for estimating the programming effort of a
software system in terms of the operators and operands used; and (ii) the McCabe cyclomatic complexity
measure that analyzes the number of the different execution paths in the system in order to decide how
complex, modular and maintainable it is.</p>
          <p>The problem with such metrics is that they have not been developed for use in a pedagogical context.
As [Patton and McGill 2006] state such metrics have the potential to be utilized for analysis of students’
programs, but they have specific shortcomings: several metrics give emphasis on the length of the code
irrespectively of its logic and do not differentiate between various uses of language features, such as a for
versus a while loop, or a switch-case versus a sequence of if statements. When we talk about students’
programs it is clear that as educators we consider the logic of a program more important than its length,
while the appropriate utilization of language features is one of the main goals of introductory
programming courses. In this sense, researchers have proposed software metrics specifically for analyzing
student produced software.</p>
          <p>One such framework has been proposed by [Patton and McGill, 2006] and includes the following
elements: [1] language vocabulary: use of targeted language constructs and elements (e1); [2]
orthagonality/encapsulation: of both tasks (e2) and data (e3); [3] decomposition/modularization: avoiding
duplicates of code (e4) and overly nested constructs (e5); [4] indirection and abstraction (e6); [5]
polymorphism, inheritance and operator overloading (e7).</p>
          <p>Patton and McGill [2006] devised this framework in the context of a study regarding optimal use of
students’ software portfolios and propose attributing its elements to specific pedagogical objectives, and
weighting them according to the desired outcomes of the institution and instructor.</p>
          <p>Another recent study aimed at devising a list of metrics for measuring static quality of student code
and at the same time utilizing it for measuring quality of code between first and second year students. In
this study, seven code characteristics (in italics) that should be present in students’ code are analyzed in
22 properties, as follows [Breuker et al. 2011]: [1] size-balanced: (p1) number of classes in a package; (p2)
number of methods in a class; (p3) number of lines of code in a class; (p4) number of lines of code in a
method, [2] readable: (p5) percentage of blank lines; (p6) percentage of (too) long lines, [3] understandable:
(p7) percentage of comment lines; (p8) usage of multiple languages in identifier naming; (p9) percentage of
short identifiers, [4] structure: (p10) maximum depth of inheritance; (p11) percentage of static variables;
(p12) percentage of static methods; (p13) percentage of non-private attributes in a class, [5] complexity:
(p14) maximum cyclomatic complexity at method level; (p15) maximum level of statement nesting at
method level, [6] code duplicates: (p16) number of code duplicates; (p17) maximum size of code duplicates,
[7] ill-formed statements: (p18) number of assignments in an ‘if’ or ‘while’ condition; (p19) number of
‘switch’ statements without ‘default’; (p20) number of ‘breaks’ outside a ‘switch’ statement; (p21) number
of methods with multiple ‘returns’; (p22) number of hard-coded constants in expressions.</p>
          <p>Some researchers have moved a step forward and have developed special tools that perform static
analysis of students’ code. Two characteristic examples are CAP [Schorsch 1995] and Expresso [Hristova
et al. 2003]. CAP (“Code Analyzer for Pascal”) analyzes programs that use a subset of Pascal and provides
user-friendly and informative feedback for syntax, logic and style errors, while Expresso aims to assist
novices writing Java programs in fixing syntax, semantic and logic errors, as well as contributing in
acquiring better programming skills.</p>
          <p>Several other tools have been developed with the aim of automatic grading of students’ programs in
order to provide them with immediate feedback, reducing the workload for instructors and also detecting
plagiarism [Pribela et al. 2008, Truong et al. 2004]. However, in most cases these environments are
targeted to specific languages, such as CAP for Pascal and Expresso for Java. A platform and
programming language independent approach is presented in [Pribela et al. 2012]. Specifically, the usage
of software metrics in automated assessment is studied using two language independent tools: SMILE for
calculating software metrics and Testovid for automated assessment.</p>
          <p>However, none of these solutions has gained widespread acceptance. Our proposal is to utilize features
of contemporary programming environments and tools in order to teach and familiarize students with
important aspects of software quality, as well as help them acquire better programming habits and skills
without extra cost. Usually features of this kind are not utilized appropriately, although they provide the
chance to help students increase the quality of their programs easily.
3. USING PEDAGOGICAL SOFTWARE METRICS
3.1 Advantages and Disadvantages
Pedagogical software metrics can be applied with various ways in courses having a software aspect with
the ultimate goal of developing better quality software. Specifically, they can be given to students just as
guidelines to follow in order to develop quality code, or as factors that count towards grading their
software products. In the latter case it is clear that a considerable amount of time should be devoted in
training students in comprehending and applying the selected software metrics. On the one hand this is
important to take place even from introductory programming courses, since this is the time when
students formulate their “good” or “bad” programming style/habits that is not easy to change in the
future. On the other hand, novices have several difficulties to deal with when introduced to programming
and adding formal rules regarding software metrics might not be a good choice at least not for all
students. Moreover, adding more material in introductory programming courses is not easy in terms of
both time and volume of material.</p>
          <p>Several researchers and instructors have integrated software metrics in systems used for automatic
checking of software developed by students, either for grading their programs or/and for detecting
plagiarism. The advantages are several. First of all, students can get immediate feedback about their
achievements and be supported in overcoming their difficulties and misconceptions, while grading is fair.
Secondly, instructors save a great deal of time from correcting programs, a process that in the case of
large classes and many practical exercises is extremely time-consuming. Of course, developing such tools
is also not easy and requires a great deal of time and effort.
3.2 The Educational IDE BlueJ
The educational programming environment BlueJ is a widely known environment used in introductory
object oriented programming courses, since it offers several pedagogical features that assist novices.
These features can be appropriately utilized for teaching and familiarizing students with software quality
aspects described in the previous section and helping them acquire better programming habits.</p>
          <p>Editor features. The editor of BlueJ provides some features that can help students firstly appreciate
a good style of programming and secondly inspect their code for the existence of properties proposed in the
framework by [Breuker et al., 2011] or the elements proposed by [Patton and McGill, 2006], or other
similar frameworks. These features are:
− line numbers that can be used for a quick look on the lines of code in methods (p4) and classes (p3)
if the instructor considers it important and provides students with relevant measures for a project
− auto-indenting and bracket matching help students write code that is better structured and more
readable. However, several times students do not consider style so important and they write
endless lines of code with no indentation and no distinction between blocks of code. In the case of
errors that are so common in student’s programs, this lack of structure makes the detection of
errors difficult especially in the case of nested constructs (e5). The instructor can easily convey
this concept to students by presenting students such a program (or using their own ones) and
using the automatic-layout ability provided for BlueJ for presenting them the corresponding
program with proper indentation in order to help them realize the difference in practice.
− syntax-highlighting can help students easily inspect their code for ill-formed statements (p18-p21).</p>
          <p>However, the instructor has to make students comprehend that they have to inspect the code they
write and not just compile and run it. Syntax-highlighting, for example, can help students easily
detect a sequence of ‘if’ statements that should be replaced by an “if/else if..’ or ‘switch’ construct.
− scope highlighting that is presented with the use of different background colors for blocks of code
should be – in the same sense as above – utilized for a quick inspection of nested constructs (e5)
and level of statement nesting at method level (p15) in order to avoid increased complexity. The
instructor can give students some maximum values to have in mind and ring them a bell for
reconsidering the decomposition/modularization of their solution.
− method comments can be easily added in students’ code. When the cursor is in the context of a
method and the student invokes the ‘method comment’ choice a template of a method comment is
added in the source code containing the method’s name, java doc tags and basic information
regarding parameters, return types and so on. Students must understand that comments (p7)
produce more readable and maintainable code and also can be used for producing a more
comprehendible and valuable documentation view of class. This interface of a class is important in
project teams and the development of real world software systems.</p>
          <p>Moreover, if instructors think that a more formal approach should be adopted towards checking coding
styles the BlueJ CheckStyle extension [Giles and Edwards] can be used. This extension is a wrapper for
the CheckStyle (release v5.4) development tool and allows the specification of coding styles in an external
file.</p>
          <p>Incremental development and testing. Students tend to write large portions of code before they
compile and test it, increasing this way the possibility for error-prone code of less quality. We consider
that it is important to develop and test a program incrementally in order to achieve better quality code.
BlueJ offers some unique possibilities for novices towards this direction. Specifically, the ability of
creating objects and calling methods with direct manipulation techniques makes incremental development
and testing an easy process. Students are encouraged to create instances of a class (by right-clicking on it
from the simplified UML class diagram of a project presented in the main window of BlueJ) and call each
method they implement for testing its correctness. Students can even call a method by passing it - with
direct manipulation techniques - references to objects existing and depicted in the object bench. This
makes incremental developing and testing of each method much easier and less time consuming. The
invocation of methods should always be done with the object inspector of each object active, in order to
check how the object’s state is affected and also how it affects method execution. Students should be
encouraged to use the object inspector to check: encapsulation of data (e3); static variables (p11); private
and non-private attributes (p13). It is not unusual for students to write code mechanically and so it is
important for them to learn to inspect afterwards what they have written. This also stands out for
methods as well. The pop-up menu with the available methods for an object of a class, shows explicitly the
public interface of a class and can help novices comprehend public and private access modifiers in practice
and utilize them appropriately. Also, the dialog box that appears when a student creates an object or calls
a method for an object, “asks” the student to enter a value of the appropriate type for each parameter and
helps students realize whether their choices of parameters were correct (i.e. a parameter is missing or it is
not needed). Students can experiment with all the aforementioned concepts by writing the corresponding
statements in the Code pad incorporated in the main window of BlueJ.</p>
          <p>Visualization of concepts. The main window of BlueJ presents students with a simplified UML
class diagram giving an overview of a project’s structure. Specifically, the following information is
presented: name of each class; type of class (concrete, abstract, interface, applet, enum); ‘uses’ and
‘inheritance’ relation. This UML class diagram can be used for getting an overview of a project either it is
given to students for studying it or it is developed by students themselves. Students can easily inspect the
overall structure of a project, the number of classes (p1) and the depth of inheritance (p10). Students
should also be encouraged to inspect the UML class diagram in order to: detect classes representing
Redefining Software Quality Metrics to XML Schema
Needs
MAJA PUŠNIK, BOŠTJAN ŠUMAK AND MARJAN HERIČKO, University of Maribor
ZORAN BUDIMAC, University of Novi Sad
The structure and content of XML schemas, important and widely used document definitions, has a significant influence on the
quality of XML data and XML technologies in general, therefore the quality of XML Schemas and accurate assessment of the quality
is a fundamental research challenge in all fields of XML application. A good quality estimation of an XML schema can directly and
indirectly lead to a higher efficiency of its usage, simplification of information solutions, efficient maintenance, and higher quality of
data and business processes. This paper addresses challenges in measuring the level of XML schema quality by employing general
software quality metrics; a set of holistically defined and document-oriented metrics is proposed. Proposed XML Schema quality
metrics base on existing software metrics, adapted according to needs of XML schemas, addressing it mostly from a structural
perspective.</p>
          <p>
            Categories and Subject Descriptors: H.0. [Information Systems]: General; D.2.8 [Software Engineering]: Metrics — Complexity
measures; Product metrics; D.2.9. [Software Engineering]: Management — Software quality assurance (SQA)
General Terms: Software quality assurance
Additional Key Words and Phrases: software metrics, quality metrics, XML Schema
1. INTRODUCTION
The primary role of XML schemas is the definition of XML data and supporting rules regarding the use of
XML data, an important part of information technologies. XML schemas and related technologies present
an important part of IT solutions in most Slovenian companies [
            <xref ref-type="bibr" rid="ref3">Sušnik 2008</xref>
            ], EU and the world [Rishel
2011]. Using XML has spread from the field of e-business and data exchange to data presentation into
various levels of contemporary information solution architectures: (1) web service interface definitions, (2)
data models, (3) specification of business cooperation protocols between different companies (their many
uses are evident from different scientific and technical papers), etc.. Due to the widespread use, the
question of XML schema quality is often open, particularly from the aspect of structure (and content) of
XML schemas, which indirectly influence the quality of data that XML schema describes. Therefore
measuring XML schemas quality is the basic research challenge in our paper. Solution of the problem (the
composite of metrics) will directly or indirectly lead to greater efficiency in the use of XML schemes,
simplifying IT solutions, facilitating maintenance, improving the quality of data and associated business
processes. Ideally the metrics should apply the aspect of structure, content and domain, in which the XML
schema is applied, however this paper will focus mostly on structural aspect, trying to take advantage of
existing software metrics.
          </p>
          <p>
            There have been several attempts to evaluate and measure XML schemas. Few of them are
            <xref ref-type="bibr" rid="ref3">summed in
[Zhang 2008</xref>
            ]. Significantly related work was also done in [McDowell, Schmidt, Yue 2004] and
[
            <xref ref-type="bibr" rid="ref7">Narasimhan, Hendradjaya 2007</xref>
            ], where attempts to measure XML schemas as well as software in
general were made. The subject was addressed in other papers, not included in this overview, however the
background are mainly software metrics, which do not necessary always apply needs of XML schema
quality (and complexity) measurements.
          </p>
          <p>Based on surveys and interviews, conducted within the University of Maribor and nearby companies,
XML Schemas are often built irrationally in a manner, which satisfies the minimum requirements of
syntactic correctness and content sufficiency. Existing metrics only partially address the problem basing
on existing solutions known in software engineering and not addressing the problem of an objective
quality evaluation of an XML Schema. Dynamic creation and adaptation of XML schemas schedules and
presents an additional research challenge that requires the use of new approaches and solutions,
universal and specific according to a domain.</p>
          <p>The aim of this paper is definition of a new theoretical approach for evaluating the quality of XML
Schema, basing on the original concept of semantically related analysis of XML schemes and XML
documents, by using a new set of metrics. The design correctness of the newly redefined metrics was
confirmed on an expanded set of test data of already established XML schemes in the field of e-business
and integration of complex business information systems. For quality measurement purposes we gathered
quality parameters, addressing different aspects of XML Schema needs and demands.</p>
          <p>
            This paper is organized into four chapters. After the presentation of this papers background and the
description of included XML quality parameters, chapter two presents all aspects in metric types. Chapter
3 presents metric application and chapter four includes discussion of our present work and future plans.
1.1 XML schema quality parameters
The results of a systematic review of literature in the field of measuring XML schemas showed that
several metrics were applied to XML schema evaluation, extracted mainly from the methods of software
engineering measurements, focusing mostly on the complexity of XML Schemas. To include a variety of
parameters addressing complexity and quality, we searched different fields on quality measurement. The
first group of parameters was related to the structural characteristics of XML schemes
            <xref ref-type="bibr" rid="ref1">(we included a
survey, where all currently defined metrics are taken from several authors in [Zhang 2008])</xref>
            :
- XML schema size,
- Number of XML nodes and annotations,
- Number of global and local element declarations,
- Number of global or local complex types definitions,
- Number of derived complex types, number of global and local definitions of simple types,
- Number of global or local definitions of models groups (groups),
- Number of global or local definitions of groups of attributes,
- Branch elements, the average cardinality of elements, etc.
          </p>
          <p>Pleasant
use
Expert
revised
Flexible and
extendable
Well connected</p>
          <p>Well structured
Fig. 1 Quality hierarchy in XML schemas</p>
          <p>
            The typically software metrics parameters were extended with parameters form other quality
measurement fields, specifically taken from standards ISO
            <xref ref-type="bibr" rid="ref4 ref5">(ISO/IEC 9126 [McDowell, Schmidt, Yue
2004])</xref>
            , decision models theory [
            <xref ref-type="bibr" rid="ref6">Burris 2012</xref>
            ] and other papers [Zhang 2008]):
- XML schemas functionality
- XML schemas simplicity
- XML schemas scalability
- XML schemas comprehensibility
- XML schemas re-use,
          </p>
          <p>XML schemas fullness,
XML schemas integrability,
XML schemas Flexibility,
XML schemas Implementation,
XML schemas Maintenance,
Accuracy,
Validity,
Up to date,
Minimalism,
Consistency,
Portability
Security,
Interoperability
Reliability,
Effectiveness,</p>
          <p>Visibility</p>
          <p>To determine the quality levels of XML schema usage, we borrowed Maslow’s hierarchical nature
needs, which can be applied to software and to all supporting technologies, presenting our interpretation
in Fig. 1. The gathered parameters were organized into six groups, reflecting six identified XML schema
needs respectively XML schema quality demands, meeting the three main XML schema demands: (1) good
structure, (2) consistent contents, (3) compliant with domain. All parameters, contributing to XML
schema quality and all aspects of quality are combined in Fig. 2.
Fig. 2 Quality aspects in XML schemas</p>
          <p>
            Fig. 3 Quality-complexity dependance
2. METRIC TYPES
So that individual metrics could be compared, NORMALIZATION of parameters was conducted. All the
parameters that were used within the metrics and their results were transformed to a scale of 0 to 1,
where 0 represented the worst value for each parameter and 1 the best value. The transformations based
on linear programming, assuming that the growth relationship is linear. The following metrics address all
aspects of XML schema quality.
2.1 Structural aspect
Other authors have researched measuring the structure of XML schemes for calculating the complexity
and quality by McDowell and others [
            <xref ref-type="bibr" rid="ref6">Burris 2012</xref>
            ]. The authors present a number of metrics, taken
mainly from "quality model" ISO standard and link them into a single formula. Each variable is further
multiplied, however the factors are not justified, values are not normalized, so the formula cannot be
applied, but we have analysed and partly used in our calculation formula of quality.
          </p>
          <p>Within the complexity calculations we can conclude that the higher the value of the individual, the
greater the complexity (the relationship is shown in Fig. 3). According to XML schema needs we redefined
metrics into the following composite metric (1) with the following parameters:
- S1 - relationship between simple and complex data types
- S2 - relationship between annotations and the number of elements
- S3 - average number of restrictions on the declaration of a simple type
- S4 - percentage of the derived type declarations of total number of declarations complex types
- S5 - diversification of the elements or 'fanning' which is influenced by the complexity of XML
schemas suggesting inconsistencies in XML schemas that unnecessarily increase the complexity
2.2 Transparency and documentation of the XML Schema
The importance of well documented and easy-to-read/understand XML schema is addressed in the
following relationship: number of annotation (NAn) depending on the number of items (NE) and attributes
(NAt) illustrates the documentation of XML schemas, supposing that more information about the building
blocks increases the quality. The parameters in metric 2 regard transparency and documentation.
 1 =
(1)
(2)
(3)
2.3 XML schema optimality
In metric 3 we combined several parameters, indicating the optimal structure of an XML Schema. The
metric evaluates whether the in-lining pattern has been used, the least preferable one in XML schema
building. In doing so, we focus on the following relationships:
- (O1) The relationship between local and all elements
- (O2) The relationship between local attributes and all attributes
- (O3) The relationship between global and complex elements of all the complex elements
- (O4) The relationship between global and all the simple elements of simple elements.</p>
          <p>Ratio between XML schema building blocks (O1, O2, and O4) should be minimized; meaning
minimisation of local elements and attributes and more global simple and complex types; the number of
global elements (O3) should be as low as possible, due to the problem of several roots (such flexibility is
not always appreciated). This particular parameter differentiates domains into two groups (the flexible
ones appropriate to validate multiple different XML schemas, and the strict ones, striving to one root
policy for validity or other reasons). In metric 3 we assumed the majority of XML schemas want a certain
level of flexibility, therefore the aspect of security was disregarded.</p>
          <p>3 =</p>
          <p>O1 + O2 + (1 − O3) + O4</p>
          <p>
            4
The metrics, described in the following subchapters, use a similar set of parameters:
(NE) Number of elements
(NAt) Number of attributes
(NAn) Number of annotations
(LOC) Number of lines of code
(Nre_all) - number of references to elements (simple and complex)
(Nra_all) - number of references to attributes
(Nrg_all) - number of references to groups (elements and attributes)
(Nri_all) - the number of schemes and imported
(Ng) - The number of groups
2.4 XML schema minimalism
In this metric we combine the parameters that indicate the minimum XML schemas building blocks,
where the concept of minimalism is defined as the level, where one can anticipate that there is no other
set of less building blocks, however still descriptive full:
2.5 XML schema re use
The equation was inspired by author [
            <xref ref-type="bibr" rid="ref8">Washizaki, Fukazawab 2005</xref>
            ], where we summed up and defined a
set of metrics for measuring the re-use of the software. The metric includes parameters that allow the
reuse and are inherently global. We included the following parameters:
 4 =
  +
          </p>
          <p>
            +  
2.6 XML schema integrability
Definition of equation was taken from the idea of density of software components [
            <xref ref-type="bibr" rid="ref7">Narasimhan 2007</xref>
            ],
where the authors calculate the density of the other segments of the software and the density of
interactions between them (lines of code, operations, classes, modules ...).We adjusted and simplified the
formula into the following equation:
3. METRICS APPLICATION
We tested proposed metrics on a set of 200 XML schemas, subtracted from different domains,
acknowledging several standards, available on the market in a certain domain. Each XML schema was
evaluated manually and automatically with proposed metrics, eliminating possible duplicates due to
crossing of different fields. The results of all metrics were combined and nominated to a scale from 1-3,
where a level 1 schema is of high quality and level 3 XML schema is of low quality (using identical scale
in case of the manual evaluation). Comparing the two types of evaluation, 83% of data received an equal
evaluation (Fig. 4).
(4)
(5)
(6)
2
          </p>
          <p>Manually estimated
quality
Quality measurement
with metrics</p>
          <p>All metrics were considered as equal, therefore no priority weights are applied to each metric. This
limitation was used due to simplification of our early stage metric framework; weights were omitted for
the length purposes, since the paper does not include domain/aspect priorities clarification. We treated all
aspects of XML schema as equal due to heterogeneous domain, which were not explored in this paper.
Definition of weights will be a part of our future work. For the purposes of this paper, we used the
following equation:
 =  1 +  2 +  3 +  4 +  5 +  6
(7)</p>
          <p>A presentation of metrics application is shown in figure (Fig. 5).A sum of 220 real-life standard or
semi-standard XML schemas was used to apply defined metrics. Evaluation software produced a resulting
XML document with a summary of all data, some warnings or eventual errors and metric results.</p>
          <p>Fig. 5 Metric application example based on an XML schema.
4. DISCUSSION
The focus of the paper was definition of a full set of parameters for assessing the quality of XML schemes,
trying to include all aspects and needs of XML schema quality. We defined six metrics, focusing on
important aspects of XML schema quality, and repositioned XML schema facts into parameters,
measuring the importance of each building block. To assure correctness, we evaluated each XML schema
manually based on a simple overview, noting clearness and readability; and compared our results with
metrics’ results. The overlapping was at 83%.</p>
          <p>Correct (and quick) measurement of XML Schema quality provides a strategic decision-making and
improvement in data organization, as a standard mechanism (internal or global) for evaluation of XML
schemes quality. Software metrics are a good basis for XML schema quality measuring, however some
accommodations are necessary according to their needs and demands. As users operate with different
data from multiple domains of XML technologies application, the quality measurements vary depending
on the flexibility (or inflexibility) of structures.</p>
          <p>In future work we will further explore applicability of defined metrics, their success and validity on
practical examples and the need for metrics adaptability according to the domain in which an XML
schema is used.
SSQSA Ontology Metrics Front-End1
MILOŠ SAVIĆ, ZORAN BUDIMAC, GORDANA RAKIĆ AND MIRJANA IVANOVIĆ, University of Novi Sad
MARJAN HERIČKO, University of Maribor
SSQSA is a set of language independent tools whose main purpose is to analyze source code of software systems in order to evaluate
their quality attributes. The aim of this paper is to present how a formal language that is not a programming language can be
integrated into the front-end of the SSQSA framework. Namely, it is explained how the SSQSA front-end is extended to support
OWL2 which is a domain-specific language for the description of ontological systems. Such extension of the SSQSA front-end
represents a step towards the realization of a SSQSA back-end which will be able to compute a hybrid set of metrics that reflect
different aspects of complexity of ontological descriptions.</p>
          <p>Categories and Subject Descriptors: D.2.8 [Software Engineering]: Metrics – Complexity measures; I.2.4 [Artificial
Intelligence]: Knowledge Representation Formalisms and Methods – Representation languages
General Terms: Languages, Measurement
Additional Key Words and Phrases: OWL2, Ontology metrics, Complexity, SSQSA, eCST representation
1. INTRODUCTION
With the rise of the semantic web, ontologies have become a key technology to provide formal description
of shared and reusable knowledge. Viewed as “explicit specification of conceptualization” [Gruber 1993],
ontologies are used to define concepts and relations present in a domain in order to support reasoning,
integration, and aggregation of data by autonomous software agents. Since real-world ontologies rapidly
increase in size, it has become highly important to measure, evaluate and understand their complexity, in
order to be able to control their maintenance and evolution.</p>
          <p>SSQSA is a set of language-independent tools that statically analyze software systems in order to
evaluate their quality attributes [Budimac et al. 2012]. The whole framework is organized around the
enriched Concrete Syntax Tree (eCST) representation of source code [Rakić and Budimac 2011b]. The
motivation for this work was to explore the possibility to use the eCST represen tation to compute metrics
which reflect the complexity of ontological descriptions. In order to obtain the eCST representation of
ontology, the SSQSA front-end has to be extended to support a language for the description of ontological
systems. The aim of this paper is to explain how the SSQSA front-end is extended to support OWL2
language in functional-style syntax.</p>
          <p>The rest of the paper is structured as follows. The next section presents the related work. Section 3
covers the integration of OWL2 into the SSQSA framework. In the next section are discussed the benefits
of the eCST representation of ontology. The last section concludes the paper and gives directions for
future work.</p>
          <p>This work was partially supported by the Serbian Ministry of Education, Science and Technological Development through project
Intelligent Techniques and Their Integration into Wide-Spectrum Decision Support, no. OI174023. The authors also would like to
thank Rok Žontar for fruitful discussions on ontology metrics.</p>
          <p>Author's address: M. Savić, Z. Budimac, G. Rakić, M. Ivanović, Department of Mathematics and Informatics, Faculty of Sciences,
University of Novi Sad, Trg Dositeja Obradovića 4, 21000 Novi Sad, Serbia, email: {svc, zjb, goca, mira}@dmi.uns.ac.rs; M. Heričko,
Institute of Informatics, Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova ulica 17, 2000
Maribor, Slovenia, email: marjan.hericko@uni-mb.si.
2. RELATED WORK
2.1 Ontology metrics
In recent years, various metrics for measuring the complexity of ontological descriptions were proposed.
Inspired by Chidamber and Kemerer [1994] metrics suite, Yao et al. [2005] proposed three cohesion
metrics which are defined on a graph that represent subsumtion dependencies between ontological
concepts. Orme at al. [2006] introduced three coupling metrics which are defined on the graph
representation of ontology. Tartir et al. [2005] introduced OntoQA metric suite that contains 12 structural
metrics also defined on ontological graph. Zhang et al. [2010] also proposed several new graph-based
structural metrics for ontology evaluation. Their metrics suite, among others, contains metrics adopted
from the Chidamber-Kemerer suite (NOC, DIT, CBO). Žontar and Heričko [2012] analyzed software
metrics from the Lorenz-Kidd, Chidamber-Kemerer and Abreu metric suites in order to determine which
of them can be adopted for ontologies. The results of their study show that graph-based software metrics
can be adopted for ontology evaluation.
2.2 SSQSA Framework
The SSQSA framework consists of two parts, SSQSA front-end also known as eCST Generator, and the
set of SSQSA back-ends, individual tools that operate on the eCST representation of source code. The
main characteristic of eCST representation is that it contains so called universal nodes,
languageindependent markers that denote the meaning of concrete language constructs. The architecture of
SSQSA is presented in Figure I. Also, it is shown how the architecture is planned to be extended with a
new back-end in order support the analysis and evaluation of ontological systems.</p>
          <p>SSQSA originated from a language-independent software metrics tool SMIILE [Rakić and Budimac
2011a]. SMIILE uses the eCST representation to calculate metrics reflecting internal complexity of
software entities such as LOC and Cyclomatic complexity [McCabe 1976]. It is also integrated with
Testovid, a semi -automated assessment system for students’ programs, in order to provide metric-based
qualification of programming assignments [Pribela et al. 2012]. SSCA was the first SSQSA back-end
which extended the applicability of the eCST representation [Gerlec et al. 2012]. This tool tracks and
analyzes changes in the hierarchical structure of software entities and stores its results in a repository
that also contain metric values obtained using SMIILE. The last realized SSQSA back-end is SNEIPL
[Savić et al. 2012]. This tool extracts dependency networks formed by software entities that can be used to
`
analyze the design complexity of software systems under the framework of complex network theory.
Obtained networks can be also viewed as fact-bases required for reverse engineering activities and used to
calculate metrics related to software design.</p>
          <p>Currently SSQSA supports six general-purpose, imperative programming languages: Java, C#, Delphi,
Modula-2, Pascal and Cobol. Therefore, this work is the first attempt to extend the SSQSA front end to
produce the eCST representation of a declarative, domain-specific language.
3. INTEGRATION OF OWL2 LANGUAGE INTO THE SSQSA FRONT-END
eCST Generator uses parsers generated by the ANTLR [Parr and Quong 1995] parser generator to
produce the eCST representation of source code that is provided as input. The advantage of using ANTLR
to describe languages supported by SSQSA is the ANTLR grammar notation itself. This notation enables
modification of syntax trees through tree rewrite rules that are attached to grammar productions.
Therefore, in order to integrate OWL2 into the SSQSA front-end the following steps have to be made:
1. Realization of ANTLR grammar which describes OWL2 FSS,
2. Identification of OWL2 language constructs that corresponds to existing eCST universal nodes,
3. Incorporation of eCST universal nodes into tree-rewrite rules of the grammar in order to obtain
eCST representation of parsed text.
3.1 Step 1 – ANTLR grammar for OWL2 FSS
The formal specification of OWL2 FSS in Extended Backus-Naur form (EBNF) can be found in the official
W3C OWL2 language specification [Motik et al. 2012]. The ANTLR grammar notation closely follows
EBNF, thus the grammar in [ Motik et al. 2012] can be easily adopted for ANTLR. At this stage of the
integration, the realized grammar is tested using ten ontologies from TONES2 repository which are
previously converted into OWL2 FSS using Protégé3. The results are summarized in Table I. It can be
seen that the parser generated from the grammar successfully parsed more than 1.4 millions of lines of
real-world ontological axioms in less than three minutes.
FMA (Foundational Model of Anatomy)
GEO Skills
144252
316101
233608
476111
182656
20506
46
25412
23707
5931</p>
          <p>Parse time [s]
23
41
22
59
17
2
3.2 Step 2 – Universal nodes
OWL2 FSS language contains four types of tokens: keywords, separators, identifiers and constants. For
each of mentioned lexical categories there are already introduced eCST universal nodes. Ontological
axioms are marked with STMT universal node which is used to mark individual statements in imperative
2 http://owl.cs.manchester.ac.uk/repository/
3 http://protege.stanford.edu/
programming languages. Elements of an axiom are also marked with existing universal nodes (TYPE,
ARGUMENT_LIST, and ARGUMENT). The PACKAGE_DECL universal node denotes that entities
declared in an eCST sub-tree rooted at this node are mutually visible. Therefore, PACKAGE_DECL
corresponds to the declaration of ontology. Declarations of ontological entities (concepts, roles and
individuals) are marked with ATTRIBUTE_DECL universal node which is used to denote declarations of
global variables in imperative programming languages. Ontological expressions that can be nested (class
and data range expressions) are marked with the EXPR universal node.</p>
          <p>OWL2 is a declarative, domain-specific language. Before the integration of OWL2, SSQSA supported
several programming languages none of them being declarative or domain-specific. OWL2 axioms
represent explicitly stated relations among ontological entities. Therefore, we introduced three new
universal nodes that denote different categories of explicitly stated relations in general:
1. BINARY_RELATION (BR) marks binary relations
2. SYMMETRIC_RELATION (SR) marks symmetric n-ary relations
3. PARTIALLY_KNOWN_BINARY_RELATION (PKBR) marks binary relations in which one of the
arguments is not known at the moment.</p>
          <p>With BINARY_RELATION are marked all OWL2 relations that denote subsumptions and assertions. The
SYMMETRIC_RELATION universal node is associated with relations indicating the equivalent and
disjoints classes, same and different individuals, and equivalent and disjoint object properties. The
PARTIALLY_KNOWN_BINARY_RELATION universal node marks object property domain and object
property range relations. The newly introduced universal nodes are currently used only in the eCST
representation of ontological descriptions. However, they can be used to mark explicitly stated binary and
symmetric relations in other descriptive languages as well. Explicitly stated relations among entities in
already supported imperative programming languages are marked with specific, more concrete universal
nodes, such as EXTENDS and IMPLEMENTS. Those universal nodes can be viewed as sub-concepts of
the BINARY_RELATION universal node.
3.3 Step 3 – Tree-rewrite rules
Once the correspondence between constructs of a concrete language and eCST universal nodes is
identified, it is pretty straightforward to incorporate universal nodes into tree rewrite rules of the
grammar. For example, it has been identified that ontology declarations correspond to the
PACKAGE_DECL universal node. Therefore, PACKAGE_DECL universal node will be incorporated in
the tree rewrite rule of the production that describe ontology declaration as the following excerpt from the
OWL2 FSS grammar shows:
ontology : 'Ontology' '(' (ontologyIRI versionIRI?)? importo* annotation* axiom* ')'
-&gt; ^(PACKAGE_DECL
^(KEYWORD 'Ontology')
^(SEPARATOR '(')
(ontologyIRI versionIRI?)?
importo* annotation* axiom*
^(SEPARATOR ')')
);
Besides the PACKAGE_DECL universal node, two other universal nodes are also incorporated in the rule:
KEYWORD and SEPARATOR to mark keywords and separators in ontology declaration, respectively.</p>
          <p>Figure II shows how a simple ontology named “PL” looks in the eCST representation. The complete
description of the ontology in the functional-style syntax is as follows:</p>
          <p>Ontology (:PL</p>
          <p>SubClassOf(:C :CPP)
)
The SubClassOf axiom states that each program written in the programming language C is at the same
time valid C++ program.
`
4. BENEFITS OF OWL2 INTEGRATION INTO SSQSA
Metrics that reflect complexity of a description written in a programming or formal language can be
classified as follows:
1. Metrics of internal complexity reflect lexical and syntactical complexity of the description or some
of its parts. Lexical complexity measures are derived from the lexical elements of a language and
reflect the complexity that is related to the volume of the description. Representative metrics
which belong to this category are LOC family of metrics and Halstead [1977] complexity
measures. Syntactical complexity is related to the compositional (structural) complexity of
concrete language constructs. Cyclomatic complexity is an example of widely used measure of
syntactical complexity.
2. Metrics of design complexity reflect the complexity of dependency structures among identifiers
introduced in the description. Those metrics quantify inheritance, coupling and cohesion
relationships among entities represented by the identifiers. Representative examples are CBO,
NOC, DIT and LCOM metrics from the Chidamber-Kemerer metrics suite.
3. Hybrid metrics combine metrics of internal and design complexity. Examples are WMC and RFC
from the Chidamber-Kemerer metrics suite, and the Henry-Kafura complexity [Henry and Kafura
1981].</p>
          <p>As it can be seen from the review of related works on ontology metrics, the complexity of an ontological
description is viewed as some measure of complexity of underlying graph representation. In other words,
already introduced ontology metrics belong to the category of design complexity metrics. The integration
of OWL2 into the SSQSA front-end provides the eCST representation of ontology. This representation can
be used to define (or adopt) and compute metrics of internal complexity, which is not possible in the
graph-based representation of ontology. For example, Halstead complexity metrics adopted for ontologies
can be calculated in the same way as for software systems: by counting e CST universal nodes
representing lexical categories. Similarly, the statement and expression level universal nodes can be used
to derive syntactical complexity measures. Currently, it is possible to use the SMIILE back-end to obtain
LOC and Halstead metrics for ontological descriptions. SMIILE also calculates cyclomatic complexity (CC)
for software systems, but this metric cannot be adopted for ontology evaluation, since there are no OWL2
language elements that correspond to branch and loop statements. However, the predicate counting
procedure used for the computation of the CC metric in SMIILE can be adopted to derive the complexity
of nested OWL2 class and data range expressions (by counting EXPR universal nodes in eCST).</p>
          <p>The ATTRIBUTE_DECL universal node can be used to recognize the declarations of ontologies,
declarations of ontological concepts, roles and named individuals in represented ontological description.
Relations among those entities can be identified by the analysis of eCST sub-trees rooted at BR, SR and
PKBR universal nodes (see Section 3.2). This means that the graph representation of ontology can be
extracted from the eCST representation of ontology. Therefore, an ontology metrics tool that is based on
the eCST representation also can be able to compute metrics of design complexity. Finally, metrics of
internal and metrics of design complexity can be combined to obtain hybrid complexity metrics. The
extraction of the graph representation of ontology is fundamentally different problem that the extraction
of software networks due to the structural difference between ontological and software entities. The
hierarchy tree representation of an ontological description can be obtained using the SNEIPL back-end,
but SNEIPL cannot be used to identify horizontal dependencies (dependencies between entities of the
same type) among ontological concepts and individuals. Those ontological entities are structurally atomic,
i.e. they are not composed out of other ontological entities. To the contrary, software entities (classes,
functions, etc.) are not structurally atomic: the definition of a software entity A associates the name of A
with a body that contains the structure of A. Horizontal dependencies between software entity A and
other entities are contained in the body of A, while horizontal dependencies between ontological entities
are independent of ontological declarations. Since the SNEIPL back-end cannot be used to obtain the
graph representation of ontology, a new SSQSA back-end that computes graph-based ontological metrics
will be developed (see Figure I). This back-end will reuse and adopt modules from SMIILE to compute
metrics of internal complexity, as well as modules from SNEIPL to form the hierarchy tree representation
which is the first step in the extraction of ontological graph (identification of ontological entities and
vertical dependencies).</p>
          <p>
            The SSCA back-end constructs and compares hierarchy trees of two consequent versions of a software
system in order to determine changes in vertical dependencies (dependencies among entities at different
levels of abstraction). The hierarchy tree representation of an ontological description can be obtained from
the eCST representation in the same way as for software systems: it is entirely determined by the
hierarchical structure of eCST universal nodes in concrete eCSTs. This means that this back-end can be
applied for ontologies in order to identify which concepts and named individuals are added or removed in
the next version of ontology, and to what extent. Finally, with the design and development of new SSQSA
back-ends it will be investigated whether they can be applied to analyze both software and ontological
systems.
5. CONCLUSION AND FUTURE WORK
In this paper we described how the SSQSA front-end is extended to support OWL2 language in
functionalstyle syntax. It is also shown that the eCST representation of ontologies can be used to compute metrics
that reflect both internal and design complexity of ontological descriptions. Therefore, our future work
will include the development of a SSQSA back-end, as shown in Figure I, that uses the eCST
representation of ontology to compute metrics reflecting different aspects of complexity of ontological
descriptions. In our future work we will also investigate whether recently introduced metrics of cognitive
complexity of programs written in object-oriented languages [Misra et al. 2012] and metrics of complexity
of web services descriptions [Basci and Misra 2012] can be adopted and used for ontology evaluation.
Dilek Basci and
            <xref ref-type="bibr" rid="ref10">Sanjay Misra. 2012</xref>
            . Metric suite for maintainability of eXtensible Markup Language web services. IET Softw. 5, 3,
320-341.
          </p>
          <p>
            Zoran Budimac, Gordana Rakić, and
            <xref ref-type="bibr" rid="ref17">Miloš Savić. 2012</xref>
            . SSQSA architecture. In Proceedings of the Fifth Balkan Conference in
          </p>
          <p>Informatics (BCI '12). ACM Conf. Proc. 1479, 287-290.</p>
          <p>Shyam R. Chidamber and Chris F. Kemerer. 1994. A Metrics Suite for Object Oriented Design. IEEE Trans. Softw. Eng. 20, 6,
476493.
Črt Gerlec, Gordana Rakić, Zoran Budimac, and Marjan Heričko. 2012. A programming language independent framework for
metrics-based software evolution and analysis. Computer Science and Information Systems 9, 3, 1155-1186.
Thomas R. Gruber. 1993. A translation approach to portable ontology specifications. Knowl. Acquis. 5, 2, 199-220.
Maurice H. Halstead. 1977. Elements of software science. Elsevier North-Holland, Amsterdam.</p>
          <p>Sallie M. Henry and Dennis G. Ka fura. 1981. Software structure metrics based on information flow. IEEE Computer Society Trans.</p>
          <p>Software Engineering 7, 5, 510-518.
`</p>
          <p>Mobile Device and Technology Characteristics’ Impact on
Mobile Application Testing
TINA SCHWEIGHOFER AND MARJAN HERIČKO, University of Maribor
Mobile technologies have a significant impact on processes in ICT, including software development. Within mobile technologies a
new type of software has emerged: mobile applications. Nowadays, the concept of mobile applications is widely known and the
development of mobile applications is more and more widespread. One of the most important parts of mobile application
development is mobile applications testing. The testing process has always been very important and crucial in the software
development cycle, which is why testing constitutes an important aspect of software development. An appropriate testing procedure
significantly increases the quality level of the developed product. With mobile application development testing, new challenges
associated with mobile technologies and device characteristics, have arisen. Some examples of these challenges are: connectivity,
convenience, touch screen technology, context awareness, supported devices, etc. It is important that we adequately address these
challenges and perform an appropriate mobile application testing process, resulting in a high quality product without critical defects
that could cause quality issues or the unwanted waste of human or financial resources. In this paper, we will present a mobile
application testing process. We will indicate the important parts and especially emphasize the challenges related to mobile devices
and technology features and properties.</p>
          <p>General Terms: Mobile applications testing
Additional Key Words and Phrases: testing, mobile applications, mobile technologies, quality
1. INTRODUCTION
Mobile devices and mobile applications play an important role in our everyday lives. Nowadays we are
surrounded by mobile technology and cannot imagine running personal or business errands without them.
This has been confirmed by numerous pieces of research. According to Gartner, the worldwide sale of
mobile phones in the third quarter of 2012 reached almost 428 million units. Within this number,
smartphone sales represent almost 40 percent of total mobile phone sales [Gartner 2012]. A similar thing
is happening in the area of mobile subscriptions. At the end of 2012, there were approximately 6.8 billion
mobile subscribers in the world, which is equal to 96 percent of the world population. Currently, global
mobile-cellular penetration rates are 96 percent. In Europe the number is higher, at 126 percent [ITU
2013].</p>
          <p>Closely related to mobile devices are mobile applications. By the end of 2012, there were approximately
1.1 million mobile applications users. According to forecasts, the number will grow rapidly – by nearly 30
percent per annum - to reach 4.4 billion by the end of 2017 [Whitfield 2013a]. Applications generated $12
billion in revenue in 2012 and a total of 46 billion applications were downloaded [Portio Research 2012].
This number is also expected to grow: in 2013 smartphone and tablet users will download a further 82
billion applications [Whitfield 2013b]. Mobile applications are currently represented in almost every
possible personal or business domain. Although games still constitute the largest category in most of the
major application stores [Whitfield 2013b], mobile applications can be seen in just about every industry.
Some examples include: retail, media, travel, education, healthcare, finance, social, business applications,
collaboration and more [uTest 2012]. Some of these applications within a specific</p>
          <p>domain use more or less sensitive user data. Users frequently allow access to personal data in the
context of mobile devices and also enter a lot of personal information. In this context, the issue of users’
trust takes on an important role. It becomes important to provide quality mobile applications that are
reliable and flawless [Hu and Neamtiu 2011]. Applications that are reliable and work flawlessly within
expected functionalities can gain a user’s trust and, more importantly, keep it. Users also often have high
Author's address: T. Schweighofer, Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova 17,
2000 Maribor, Slovenia; email: tina.schweighofer@uni-mb.si; M. Heričko, Faculty of Electrical Engineering and Computer Science,
University of Maribor, Smetanova 17, 2000 Maribor, Slovenia; email: marjan.hericko@uni-mb.si.
expectations about the quality of mobile applications. Applications that crash and lose users’ personal
data are not allowed [Bo et al. 2007]. One of most important mechanisms for providing reliable, flawless
and quality mobile applications is an appropriate testing procedure. Testing during mobile application
development is slightly different from testing procedures in traditional software and the process itself is
also suited to the area of mobile applications and mobile technologies.</p>
          <p>In this paper we will present a testing procedure for testing mobile applications. We will identify and
describe specific characteristics for mobile devices, mobile applications and mobile technologies as a
whole, which have a significant impact on the testing procedure. First, in Section 2, we will present the
fundamentals of software testing and reveal some of the major differences between testing traditional and
mobile software. We will also provide an introduction to mobile application testing. In Section 3, we will
present some of the specific characteristics of mobile technologies that have an impact on testing and
challenges in testing mobile applications. Everything will be cemented with a practical approach for
mobile application testing procedures and gained experiences. In the Discussion, we will present the
findings and results of our work.
2. FUNDAMENTALS OF MOBILE APPLICATION TESTING
Mobile application development has specific characteristics that need to be addressed through the entire
product’s life cycle. According to a recent study [Wasserman 2010], there are important software
engineering research issues linked to mobile application development. Some of these issues include:
potential interaction with other applications, handling available sensors, the development of native or
hybrid mobile applications, different families of hardware and software mobile platforms, problems of
security, an adjusted user interface and the problem of power consumption.</p>
          <p>Testing process plays an important role in the life cycle of a software product, whether in mobile or
traditional desktop application. Therefore, it is crucial to address abovementioned issues in related mobile
testing procedures.</p>
          <p>A lot of research has dealt with the fundamentals of software testing, therefore there are many
available definitions of testing. To summarize one of the definitions: testing is an activity performed for
the purpose of evaluating product quality, and for improving the product by identifying potential defects
and problems. Software testing is composed of the dynamic verification of the program behavior on a
finite set of test cases against the expected program behavior [Bourque and Dupuis 2004].</p>
          <p>Testing is not just an activity that starts after the coding phase is finished and is used to detect
failures. Software testing is a procedure that should be active through the entire product life cycle, from
the development and maintenance process to actual product construction. Also, the planning phase for
testing should occur early in the product requirements process and test plans must be systematically and
continuously developed, as the development of a product proceeds. Currently it is considered that the
right strategy for quality is one of prevention. It is much better to avoid problems than to correct them.
Therefore, testing must be viewed as a procedure for checking if prevention was successful and for
identifying faults in cases where prevention was not effective [Bourque and Dupuis 2004].</p>
          <p>An important aspect that makes mobile testing different is the complexity of testing, a point made by
the authors of the aforementioned study [Wasserman 2010]. A challenge that they mention is the
diversity of different available mobile devices, for example Android devices and others related to testing
native mobile applications. There are also many other challenges related to mobile application testing. We
will describe these challenges in detail in the subsection below.
2.1</p>
          <p>Mobile application as testing object
If we want to properly understand the concept of mobile application testing, it is important that we
understand what a mobile application is. We are all familiar with mobile applications, but what does the
definition say? A mobile application is a type of software application designed to run on smart phones,
tablets and other mobile devices and/or for taking in input information. Similarly, mobile applications in</p>
          <p>Mobile Device and Technology Characteristics Impact on Mobile Application Testing • 13:105
the context of mobile computing is an application that runs on an electronic device that may move
[Kirubakaran and Karthikeyani 2013].</p>
          <p>The testing of mobile applications is an important and also very difficult task, according to various
authors [Bo et al. 2007; She et al. 2009; Kirubakaran and Karthikeyani 2013; Franke and Weise 2011].
They all believe that testing mobile applications is a non-trivial process that takes a lot of time, effort and
other resources. We have had the same experience with projects where we developed mobile applications
for Android, iOS and BlackBerry. The experience is described in detail below in Section 3. As previously
mentioned, as mobile applications become more and more complex and ubiquitous, users have higher and
higher expectations with regard to mobile application quality. Users want an application that does not
fail, lose data or harm the device’s operability, as well as applications that are secure, reliable and easy to
use. If we conduct the testing procedure properly, possible defects embedded in the application can be
detected and removed and this can lead to greater confidence in an application [Bo et al. 2007; She et al.
2009].</p>
          <p>The challenges encountered during mobile application testing were mostly related to the different
characteristics of mobile devices or mobile technologies, which has a direct influence on mobile
applications and the conducted testing procedure. In the existing literature we found many different
described characteristics. As noted by [Kirubakaran and Karthikeyani 2013; Franke and Weise 2011]
these characteristics are: connectivity, convenience, user interface, supported devices, touch screens, new
programming languages, resource constraints, context awareness and data persistence. The mentioned
characteristics are presented in Figure 1.</p>
          <p>Connectivity</p>
          <p>Supported devices</p>
          <p>Resource constraints
Convenience</p>
          <p>Touch screen</p>
          <p>Context awareness
User interface</p>
          <p>Programming languages</p>
          <p>Data persistence</p>
          <p>Fig. 1. Characteristics of mobile devices and technologies with their impact on the testing procedure
3. CHALLENGES IN MOBILE APPLICATION TESTING
As previously mentioned, during mobile application testing we came across different challenges. Different
authors have already investigated some of the challenges that have a significant influence on the testing
procedure. We came across the same characteristics that consequently represent challenges in testing
mobile applications. As mentioned, we developed mobile applications for the operating systems Android,
iOS and BlackBerry in the context of a research and development project. Mobile applications are a part
of the larger project, which also include a web application. Within the development process, we also
perform mobile application testing. The process of application testing is a complex process, but for the
needs of this article we will show a simplified version. The simplified testing process can be seen in Figure
2. The process starts with the release of a version of the mobile application for a specific platform for
testing purposes. The Quality Assurance team receives aversion and starts the process of testing based on
the recorded test scenarios. If they find an irregularity, an error or an unreliable function, they report the
problem to the web-based bug tracking system. Bugs are seen by the development team and later fixed.
We have to point out that within our project, we also performed different types of test cycles. The most
common was the weekly testing procedure. There is also testing for the purpose of the application’s
release on the belonging market.</p>
          <p>MVoebrisleioanprpelilecaatsieon
version release</p>
          <p>QA receives mobile
application</p>
          <p>Report problem into
bug tracking system</p>
          <p>Test
scenarios</p>
          <p>Testing process</p>
          <p>Bug fixed by
development team</p>
          <p>The most important part of the testing process is the execution of test scenarios, where specific
characteristics of mobile devices are revealed. In fact, they also play an important part in writing testing
scenarios, where we have to shape each test scenario in a way that it will consider and verify a specific
characteristic. When we started to write and later execute specific test scenarios, we reviewed existing
literature from the area of mobile application testing. Specific characteristics identified in different works
were taken into account within our own testing procedure. The nature of these characteristics, what
existing literature says, and how we dealt with them is discussed below.</p>
          <p>The first property we came across and has an impact on many different types of testing is connectivity.
Mobile applications have to be designed with the awareness that they will be always online, because
mobile devices are always logged on to a mobile network. Networks can vary according to speed, reliability
and security. Especially slow and unreliable wireless networks are a common obstacle for mobile
applications. The described property has to be considered in functional testing, where different networks
and connectivity scenarios have to be performed, with an emphasis on popular networks. Connectivity
also has an effect on performance, security and reliability testing [Kirubakaran and Karthikeyani 2013;
uTest 2012]. In practice, we consider the characteristic connectivity in such a way that we test our
applications in different networks. We also perform test scenarios to test different internet connections.
We use different Wi-Fi networks and cellular networks by different operators and in different places, like
buildings, city centers or in nature. For our application, connectivity is very important because functions
in mobile applications are supplemented with web applications, so the application uses the function of
synchronization very often.</p>
          <p>Another important property according to other studies is the user interface, which is related to the
characteristic of convenience. This property is important because user interfaces in development need to
follow specific guidelines based on the different platforms for which they are being developed. Different
platforms have their own rules and guidelines about how a specific user interface should look, so if a
product in development is being developed for different platforms we have to strongly focus on a specific
design. Regardless, different platforms still present a big challenge in terms of designing the best possible
use of limited screen space, so that the design of the user interface takes greater importance in the
development process. The user interface looks different based on the mobile device’s screen resolution and
its dimensions. Some implications on testing are seen in the area of different devices that need to be used
for testing procedure. It is recommended to test the user interface on as many different mobile device as
possible. This is because different devices behave differently with the same application code [Hu and</p>
          <p>Mobile Device and Technology Characteristics Impact on Mobile Application Testing • 13:107
Neamtiu 2011; Kirubakaran and Karthikeyani 2013; Wasserman 2010]. Within the development of
mobile applications in our project, the developers followed specific rules and good practices for designing
platform specific applications. These guidelines were also reviewed in the testing phase. We also
developed our own Style guide document, which ensured that regardless of the platform, the application
would look similar and reflect the fact that all applications are part of the same product family. With
regard to the testing process, we tested the appearance on different mobile phones, with different
resolution and different physical dimensions. We considered the minimal and optimal screen size, which
was set within the Software requirements specification document.</p>
          <p>Nowadays many different mobile devices are available. What is important is that applications work
flawlessly on as many devices as possible. Supported devices represent one of the most difficult aspects of
the testing process. Devices from different vendors have different software and hardware components. In
particular, there are hundreds of different mobile devices that run the operating system Android, whereas
the mentioned operating system has countless different versions. Different versions of operating systems
are also a great challenge to cover within the testing process [Kirubakaran and Karthikeyani 2013].
Usually it is impossible to test every available device, so we group mobile devices in different categories,
as proposed in [Kirubakaran and Karthikeyani 2013]. The focus of this challenge is on Android mobile
devices. We tested our mobile applications on mobile devices from different vendors, with different
hardware components and different versions of operating systems. We developed three groups: small,
optimized and high quality mobile devices. The first group included mobile devices with a small screen
size and low resources, while the last group included mobile devices with a high screen resolution and a
lot of resources. Test scenarios were carried out on a few representatives of each group. However, iOS
devices were a different story as there is not such a large variety of different mobile devices. The same
testing strategy was used for testing the touch screens of mobile devices and their properties, which also
represent an important challenge in mobile application testing. Touch screens are the main tool for
inputting user data into a mobile application. An important aspect is the system response time to a touch,
which depends on device resource utilization, and easily may become slow in some circumstances, such as
in the case of a busy processor, a lack of memory or other problem. Thus, it is important to test the touch
screen’s abilities under different circumstances [Kirubakaran and Karthikeyani 2013]. We tested touch
screen capabilities under different circumstances, as proposed. We burdened the processor and available
memory by running multiple applications simultaneously, for the purpose of testing the behavior of
different touch screen on different devices.</p>
          <p>As many authors agree, mobile devices are becoming more and more powerful, but their resources, like
processor power, RAM, and resolution are still facing restrictions [Kirubakaran and Karthikeyani 2013;
She et al. 2009; Franke and Weise 2011; Portio Research 2012]. This characteristic is closely linked to
some of the previously mentioned characteristics, like supported devices and touch screens. As proposed in
[Kirubakaran and Karthikeyani 2013] mobile device resources have to be continuously monitored, to see
what a specific mobile device is capable of and to verify what actions are taken if a device runs out of
resources. A very similar characteristic is data persistence, because mobile applications that run out of
memory shut down running applications, so we have to make sure user data is stored and saved
adequately [Franke and Weise 2011]. We also test these two characteristics within specified groups of
mobile device testing. We try to overload a specific mobile device and test the behavior of a mobile
application. We check if it stored data properly and of course where the breaking limit for the mobile
application is.</p>
          <p>A very important characteristic that has a significant impact on testing our mobile application is
context awareness. A lot of mobile applications also rely on sensed data, provided by context providers that
monitor the surroundings and connectivity of devices. All these provide an enormous amount of data,
which vary depending on the user’s actions and the environment. It is important to test the application
under a different environment and under any contextual input, if it is going to work correctly
[Kirubakaran and Karthikeyani 2013]. Our application uses data provided by GPS sensors and via
Bluetooth from heart rate sensors. We have to ensure that the data is provided correctly regardless of the
mobile device and its operating systems. Different operating systems support different Bluetooth devices
so we have to ensure that we test all available and supported devices properly.</p>
          <p>The characteristic that is more involved in the developing process, but still part of the testing process,
is related to new programming languages that are used for mobile application development. These
programming languages were developed to support mobility, managing resource consumption and
handling new GUIs [Kirubakaran and Karthikeyani 2013]. It is important that code during the
development process is tested properly, according to the features and characteristics of programming
languages.
BO, J., XIANG, L. AND XIAOPENG, G., 2007. MobileTest: A Tool Supporting Automatic Black Box Test for Software on Smart Mobile</p>
          <p>Devices. Second International Workshop on Automation of Software Test (AST ’07), pp.8–8.</p>
          <p>BOURQUE, P. AND DUPUIS, R., 2004. Guide to the Software Engineering Body of Knowledge. Guide to the Software Engineering Body
of Knowledge, 2004. SWEBOK.</p>
          <p>FRANKE, D. AND WEISE, C., 2011. Providing a Software Quality Framework for Testing of Mobile Applications. Software Testing,</p>
          <p>Verification and Validation (ICST), 2011 IEEE Fourth International Conference on, pp.431–434.</p>
          <p>GARTNER, 2012. Gartner Says Worldwide Sales of Mobile Phones Declined 3 Percent in Third Quarter of 2012; Smartphone Sales</p>
          <p>Increased 47 Percent.</p>
          <p>HU, C. AND NEAMTIU, I., 2011. Automating GUI testing for Android applications. In Proceedings of the 6th International Workshop on</p>
          <p>Automation of Software Test. New York, NY, USA: ACM, pp. 77–83.</p>
          <p>ITU, 2013. The World in 2013 - ICT Facts and Figures.</p>
          <p>KIRUBAKARAN, B. AND KARTHIKEYANI, V., 2013. Mobile application testing — Challenges and solution approach through automation.</p>
          <p>2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering, pp.79–84.</p>
          <p>PORTIO RESEARCH, 2012. Your Portio Research Mobile Factbook 2012.</p>
          <p>SHE, S., SIVAPALAN, S. AND WARREN, I., 2009. Hermes: A Tool for Testing Mobile Device Applications. Software Engineering</p>
          <p>Conference, 2009. ASWEC ’09. Australian, pp.121–130.</p>
          <p>UTEST, 2012. The Essential Guide to Mobile App TestingNo Title.</p>
          <p>WASSERMAN, A.I., 2010. Software engineering issues for mobile application development. Proceedings of the FSE/SDP workshop on</p>
          <p>Future of software engineering research - FoSER ’10, p.397.</p>
          <p>WHITFIELD, K., 2013a. Fast growth of apps user base in booming Asia Pacific market. Portio Research.</p>
          <p>WHITFIELD, K., 2013b. What apps are people using? Portio Research.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Literature Review and Survey: XML Schema Metrics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Wes</given-names>
            <surname>Rishel</surname>
          </string-name>
          . (
          <year>2011</year>
          ).
          <article-title>Does XML Schema Earn its Keep? The Gartner Blog Network</article-title>
          . http://blogs.gartner.com/wes_rishel/
          <year>2011</year>
          /12/31/okxml-schema
          <article-title>-does-earn-its-keep-in-hl7/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Sušnik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>V slogi je e-račun! Monitr Pro</article-title>
          , http://www.monitorpro.si/41040/praksa/v-slogi
          <article-title>-</article-title>
          <string-name>
            <surname>je-</surname>
          </string-name>
          e-racun/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Standard</surname>
            <given-names>ISO</given-names>
          </string-name>
          /IEC 9126 Software engineering
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>McDowell</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yue</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Analysis and Metrics of XML Schema</article-title>
          .
          <source>Proceedings of the International Conference on Software Engineering Research and Practice</source>
          , SERP'
          <volume>04</volume>
          , v 2, p
          <fpage>538</fpage>
          -
          <lpage>544</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Burris</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2012</year>
          ),
          <article-title>Hierarchical Nature of Software Quality, Programming in the Large, The Practice of Software Engineering</article-title>
          , http://programminglarge.com/hierarchical
          <article-title>-nature-of-software-quality/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Narasimhan</surname>
            ,
            <given-names>V.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendradjaya</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Some theoretical considerations for a suite of metrics for the integration of software components</article-title>
          .
          <source>Information Sciences</source>
          , Volume
          <volume>177</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>3</given-names>
          </string-name>
          <source>, 1 February</source>
          <year>2007</year>
          , Pages
          <fpage>844</fpage>
          -
          <lpage>864</lpage>
          . http://dx.doi.org/10.1016/j.ins.
          <year>2006</year>
          .
          <volume>07</volume>
          .010
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Washizaki</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fukazawab</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>A technique for automatic component extraction from object-oriented programs by refactoring</article-title>
          . Volume
          <volume>56</volume>
          ,
          <string-name>
            <surname>Issues</surname>
          </string-name>
          1-
          <issue>2</issue>
          ,
          <year>April 2005</year>
          , Pages
          <fpage>99</fpage>
          -
          <lpage>116</lpage>
          . http://dx.doi.org/10.1016/j.scico.
          <year>2004</year>
          .
          <volume>11</volume>
          .007
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Thomas J.</given-names>
            <surname>McCabe</surname>
          </string-name>
          .
          <string-name>
            <given-names>A Complexity</given-names>
            <surname>Measure</surname>
          </string-name>
          .
          <year>1976</year>
          .
          <source>IEEE Trans. Software Eng</source>
          .
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <fpage>308</fpage>
          -
          <lpage>320</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Sanjay</given-names>
            <surname>Misra</surname>
          </string-name>
          , Murat Koyuncu, Marco Crasso, Cristian Mateos and
          <string-name>
            <given-names>Alejandro</given-names>
            <surname>Zunino</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A Suite of Cognitive Complexity Metrics</article-title>
          .
          <source>In Computational Science and Its Application ICCSA 2012. Lecture Notes in Computer Science</source>
          , Vol.
          <volume>7336</volume>
          Springer Berlin Heidelberg,
          <fpage>234</fpage>
          -
          <lpage>237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Boris</given-names>
            <surname>Motik</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter F. Patel-Schneider</surname>
            and
            <given-names>Bijan</given-names>
          </string-name>
          <string-name>
            <surname>Parsia</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>OWL2 web ontology language structural specification and functionalstyle syntax (second edition)</article-title>
          .
          <source>Retrieved July</source>
          ,
          <year>2013</year>
          from http://www.w3.org/TR/owl2-syntax/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Anthony M. Orme</surname>
          </string-name>
          , Haining Yao, and
          <string-name>
            <surname>Letha</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Etzkorn</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Coupling Metrics for Ontology-Based Systems</article-title>
          . IEEE Softw.
          <volume>23</volume>
          ,
          <issue>2</issue>
          ,
          <fpage>102</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Terence J.</given-names>
            <surname>Parr</surname>
          </string-name>
          and
          <string-name>
            <given-names>Russell W.</given-names>
            <surname>Quong</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>ANTLR: a predicated-LL(k) parser generator</article-title>
          .
          <source>Softw. Pract. Exper</source>
          .
          <volume>25</volume>
          ,
          <issue>7</issue>
          ,
          <fpage>789</fpage>
          -
          <lpage>810</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Pribela</surname>
          </string-name>
          , Gordana Rakić, and
          <string-name>
            <given-names>Zoran</given-names>
            <surname>Budimac</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>First Experiences in Using Software Metrics in Automated Assesssment</article-title>
          .
          <source>In Proc. of the 15th International Multiconference on Information Society (IS)</source>
          ,
          <source>Collaboration, Software and Services in Information Society (CSS)</source>
          ,
          <source>Vol. A</source>
          ,
          <volume>250</volume>
          -
          <fpage>253</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Gordana</given-names>
            <surname>Rakić</surname>
          </string-name>
          and
          <string-name>
            <given-names>Zoran</given-names>
            <surname>Budimac</surname>
          </string-name>
          .
          <source>SMIILE Prototype. 2011a. In Proc. Of International Conference of Numerical Analysis and Applied Mathematics ICNAAM2011, Symposium on Computer Languages, Implementations and Tools (SCLIT)</source>
          ,
          <source>AIP Conf. Proc. 1389</source>
          ,
          <fpage>853</fpage>
          -
          <lpage>856</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Gordana</given-names>
            <surname>Rakić</surname>
          </string-name>
          and
          <string-name>
            <given-names>Zoran</given-names>
            <surname>Budimac</surname>
          </string-name>
          .
          <source>Introducing Enriched Concrete Syntax Trees. 2011b. In Proc. of the 14th International Multiconference on Information Society (IS)</source>
          ,
          <source>Collaboration, Software and Services in Information Society (CSS)</source>
          ,
          <source>Vol. A</source>
          ,
          <volume>231</volume>
          -
          <fpage>234</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Miloš</given-names>
            <surname>Savić</surname>
          </string-name>
          , Gordana Rakić, Zoran Budimac, and
          <string-name>
            <given-names>Mirjana</given-names>
            <surname>Ivanović</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Extractor of software networks from enriched concrete syntax trees</article-title>
          .
          <source>In Proc. Of International Conference of Numerical Analysis and Applied Mathematics ICNAAM2012, Symposium on Computer Languages, Implementations and Tools (SCLIT)</source>
          ,
          <source>AIP Conf. Proc. 1479</source>
          ,
          <fpage>486</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Samir</given-names>
            <surname>Tartir</surname>
          </string-name>
          ,
          <string-name>
            <surname>Budak I. Arpinar</surname>
          </string-name>
          , Michael Moore,
          <string-name>
            <given-names>Amith P.</given-names>
            <surname>Sheth</surname>
          </string-name>
          , and
          <string-name>
            <surname>Boanerges</surname>
          </string-name>
          Aleman -Meza.
          <year>2005</year>
          .
          <article-title>OntoQA: Metric-based ontology quality analysis</article-title>
          .
          <source>In Proceedings of IEEE Workshop on Knowledge Acquisition from Distributed</source>
          , Autonomous, Semantically Heterogeneous Data and
          <string-name>
            <given-names>Knowledge</given-names>
            <surname>Sources</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Hongyu</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yuan-Fang Li</surname>
          </string-name>
          , and Hee Beng Kuan Tan.
          <year>2010</year>
          .
          <article-title>Measuring design complexity of semantic web ontologies</article-title>
          .
          <source>J. Syst. Softw</source>
          .
          <volume>83</volume>
          ,
          <issue>5</issue>
          ,
          <fpage>803</fpage>
          -
          <lpage>814</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Rok</given-names>
            <surname>Žontar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marjan</given-names>
            <surname>Heričko</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Adoption of object-oriented software metrics for ontology evaluation</article-title>
          .
          <source>In Proceedings of the Fifth Balkan Conference in Informatics (BCI '12)</source>
          .
          <source>ACM Conf. Proc. 1479</source>
          ,
          <fpage>298</fpage>
          -
          <lpage>301</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>