Applications

Enhancing Software Quality in Students' Programs •

Second Workshop on Software Quality Analysis, Monitoring, Improvement and Applications SQAMIA 2013

Harri Keto (Tampere Univ. of Technology

0 1

Finland) Vladimir Kurbalija (Univ. of Novi Sad

0 1

Serbia) Anastas Mishev (Univ. of Ss. Cyril

0 1

Methodius

0 1

Skopje

0 1

FYR Macedonia) Sanjay Misra (Atilim Univ.

0 1

Ankara

0 1

Turkey) Vili Podgorelec (Univ. of Maribor

0 1

Slovenia) Zoltan Porkolab (Eotvos Lorand Univ.

0 1

Budapest

0 1

Bratislava

0 1

Slovakia)

0 1 0 Department of Mathematics and Informatics Faculty of Sciences, University of Novi Sad , Serbia 2013 1 Zoran Budimac University of Novi Sad Faculty of Sciences, Department of Mathematics and Informatics Trg Dositeja Obradovića 4 , 21000 Novi Sad , Serbia

2013

2 13 15 17

Applications SQAMIA 2013

Proceedings ISBN: 978-86-7031-269-2

Preface This volume contains papers presented at the Second Workshop on Software Quality Analysis, Monitoring, Improvement, and Applications (SQAMIA 2013). SQAMIA 2013

was held during September 15 - 17, 2013, at the Department of Mathematics and Informatics, Faculty of Sciences, University of Novi Sad, Novi Sad, Serbia.

SQAMIA 2013 is a continuation of the successful event held in 2012. Previous workshop,

the first one, was organized within the 5th Balkan Conference in Informatics (BCI 2012) in

Novi Sad. In 2013 SQAMIA becomes standalone event in intention to become traditional

meeting of the scientists and practitioners in the field of software quality.

The main objective of SQAMIA workshop series is to provide a forum for presentation, discussion and dissemination of scientific findings in the area of software quality, and to promote and improve interaction and cooperation between scientists and young researchers from the region and beyond. The SQAMIA 2013 workshop consisted of regular sessions with technical contributions

reviewed and selected by an international program committee, as well as of invited talks presented by leading scientists in the research areas of the workshop.

SQAMIA workshops solicited submissions dealing with four aspects of software quality: quality analysis, monitoring, improvement and applications. Position papers, papers describing the work-in-progress, tool demonstration papers, technical reports or other papers that would provoke discussion were especially welcome. In total, 13 papers were accepted and published in this proceedings volume. All pub

lished papers were double reviewed, and some papers received the attention of more than two reviewers. We would like to use this opportunity to thank all PC members and the external reviewers for submitting careful and timely opinions on papers.

Also, we gratefully acknowledge the program co-chairs, Tihana Galinac Grbac (Croa

tia), Marjan Heričko (Slovenia), Zoltan Horvath (Hungary), Mirjana Ivanović (Serbia) and

Hannu Jaakkola (Finland), for helping to greatly improve the quality of the workshop. We extend special thanks to the SQAMIA 2013 Organizing Committee from the Depart

ment of Mathematics and Informatics, Faculty of Sciences, especially to its chair Gordana

Rakić for her hard work, diligance and dedication to make this workshop the best it can be. Finally, we thank our sponsors, the Provincial Secretariat for Science and Technological Development, the Serbian Ministry of Education, Science and Technological Development,

and the Department of Mathematics and Informatics, Faculty of Sciences, University of

Novi Sad, for supporting the organization of this event. And last, but not least, we thank all the participants of SQAMIA 2013 for having made all work that went into SQAMIA 2013 worthwhile. September 2013

Zoran Budimac

Workshop Organization

General Chair Zoran Budimac (Univ. of Novi Sad, Serbia) Program Chair Program Co-Chairs Zoran Budimac (Univ. of Novi Sad, Serbia) Tihana Galinac Grbac (Univ. of Rijeka, Croatia) Marjan Heričko (Univ. of Maribor, Slovenia) Zoltan Horvath (Eotvos Lorand Univ., Budapest, Hungary) Mirjana Ivanović (Univ. of Novi Sad, Serbia) Hannu Jaakkola (Tampere Univ. of Technology, Pori, Finland) Program Committee Additional Reviewers Roland Király (Eotvos Lorand Univ., Budapest, Hungary) Miloš Radovanović (Univ. of Novi Sad, Serbia) Organizing Committee (Univ. of Novi Sad, Serbia)

Gordana Rakić, Chair Zoran Putnik Miloš Savić

Organizing Institution

Department of Mathematics and Informatics, Faculty of Sciences, University of Novi Sad, Serbia

Sponsoring Institutions of SQAMIA 2013

SQAMIA 2013 was partially financially supported by: Provincial Secretariat for Science and Technological Development, Autonomous Province of Vojvodina, Republic of Serbia Ministry of Education, Science and Technological Development, Republic of Serbia Department of Mathematics and Informatics, Faculty of Sciences, University of Novi Sad, Serbia iv

Stability of Software Defect Prediction in Relation to Levels of Data Imbalance TIHANA GALINAC GRBAC AND GORAN MAU SˇA, University of Rijeka BOJANA DALBELO–BASˇ IC´ , University of Zagreb Software defect prediction is an important decision support activity in software quality assurance. Its goal is reducing veri cation costs by predicting the system modules that are more likely to contain defects, thus enabling more e cient allocation of resources in veri cation process. The problem is that there is no widely applicable well performing prediction method. The main reason is in the very nature of software datasets, their imbalance, complexity and properties dependent on the application domain. In this paper we suggest a research strategy for the study of the performance stability using di erent machine learning methods over di erent levels of imbalance for software defect prediction datasets. We also provide a preliminary case study on a dataset from the NASA MDP open repository using multivariate binary logistic regression and forward and backward feature selection. Results indicate that the performance becomes unstable around 80% of imbalance.

Categories and Subject Descriptors: D.2.9 [Software Engineering]: Management—Software quality assurance (SQA) Additional Key Words and Phrases: Software Defect Prediction, Data Imbalance, Feature Selection, Stability 1. INTRODUCTION Software defect prediction is recognized as one of the most important ways to reach software development efficiency. The majority of costs during software development is spent on software defect detection activities, but their ability to guarantee software reliability is still limited. The analyses performed by [Andersson and Runeson 2007; Fenton and Ohlsson 2000; Galinac Grbac et al. 2013], in the environment of a large scale industrial software with high focus on reliability shows that faults are distributed within the system according to the Pareto principle. They prove that the majority of faults are concentrated in just small amount of system modules, and that these modules do not compose a majority of system size. This fact implies that software defect prediction would really bring benefits if a well performing model is applied. The main motivating idea is that if we were able to predict the location of software faults within the system, then we could plan defect detection activities more efficiently. This means that we would be able to concentrate defect detection activities and resources into critical locations within the system and not on the entire system.

Numerous studies have already been performed aiming to find the best general software defect prediction model [Hall et al. 2012]. Unfortunately, a well performing solution is still absent. Data in software defect prediction are very complex, and do not follow in general any particular probability distribution that could provide a mathematical model. Data distributions are highly skewed, which is connected to the popular data imbalance problem, thus making standard machine learning approaches inadequate. Therefore, a significant research has recently been devoted to cope with this problem. Author’s address: T. Galinac Grbac, Faculty of Engineering, Vukovarska 58, HR–51000 Rijeka, Croatia; email: tgalinac@riteh.hr; G. Mausˇa, Faculty of Engineering, Vukovarska 58, HR–51000 Rijeka, Croatia; email: gmausa@riteh.hr; B. Dalbelo–Basˇic´, Faculty of Electrical Engineering and Computing, Unska 3, HR-10000 Zagreb, Croatia; email: bojana.dalbelo@fer.hr. Copyright c by the paper’s authors. Copying permitted only for private and academic purposes.

In: Z. Budimac (ed.): Proceedings of the 2nd Workshop of Software Quality Analysis, Monitoring, Improvement, and Applications (SQAMIA), Novi Sad, Serbia, 15.-17.9.2013, published at http://ceur-ws.org 1:2 Several solutions are offered for the data imbalance problem. However, these solutions are not equally effective in all application domains. Moreover, there is still an open question regarding the extent to which imbalanced learning methods help with learning capabilities. This question should be answered with extensive and rigorous experimentation across all application domains, including software defect prediction, aiming to explore underlaying effects that would lead to fundamental understandings [He and Garcia 2009].

The work presented in this paper is a step in that direction. We present an research strategy that aims to explore performance stability of software defect prediction models in relation to levels of data imbalance. As an illustrative example we present an experiment taken to Stability of Software Defect Prediction in Relation to Levels of Data Imbalance our strategy. We observed how learning performance, with and without stepwise feature selection, in case of logistic regression learner, is changing over a range of imbalances in the context of software defect prediction. The findings are just indicative and are to be explored by exhausting experimenting aligned with proposed strategy. 1.1 Complexity of software defect prediction data Software defect prediction (SDP) is concerned with early prediction of system modules (file, class, module, method, component, or something else) that are likely to have a critical number of faults (above certain threshold value, THR). In numerous studies it is identified that these modules are not so common. In fact, they are special cases, and that is why they are harder to find. Dependent variable in learning models is usually a binary variable with two classes labeled as ’fault–prone’ (FP) and ’not–fault–prone’ (NFP). The number of FP modules usually is much lower, and represents a minority class, than the number of NFP modules which represents a majority class. Datasets with significantly unequal distributions of minority over majority class are imbalanced. Independent variables used in SDP studies are numerous. In this paper we will address SDP based on the static code metrics [McCabe 1976].

In SDP datasets the level of class imbalance varies for various software application domains. We reviewed the software engineering publications dealing with software defect prediction and we noticed that the percentage of the non-fault prone modules (%NFP) in the datasets varies a lot (from 1% in medical record system [Andrews and Stringfellow 2001] to more then 94% in telecom system [Khoshgoftaar and Seliya 2004]) for various software application domains (telecom industry, aeronautics, radar systems, etc.). Since there are SDP initiatives on datasets with a whole range of imbalance percentages, we are motivated to determine the percentage at which data imbalance becomes a problem, i.e., learners become unstable.

As already mentioned above, the random variables measured in software engineering usually do not follow any distribution in general, and the applicability of classical mathematical modeling methods and techniques is limited. Hence, algorithms from the machine learning have been widely adopted. Among various learning methods used in the defect prediction approaches, this paper will explore the capabilities of multivariate binary logistic regression (LR). Our ultimate goal is not to validate different learning algorithms but to explore learning performance stability over different levels of imbalance. The LR has shown very good performance in the past and is known to be a simple but robust method. In [Lessmann et al. 2008] it is the 9th best classifier among 22 examined (9/22) and at the same time it is the 2nd best statistical classifier among 7 of them (2/7). The stepwise regression classifier was the most accurate classifier (1/4) and was outperformed only in cases with many outliers in [Shepperd and Kadoda 2001]. Very good performance of logistic regression was also observed in [Kaur and Kaur 2012] (3/12 it terms of accuracy and AUC), [Banthia and Gupta 2012] (1/5 both with and without preprocessing of 5 raw NASA datasets), [Giger et al. 2011] (1/8 in terms of median AUC from 15 open source projects), [Jiang et al. 2008] (2/6 in terms of AUC and 3/6 according to Nemenyi post-hoc test), etc. However, neither of the studies has analyzed the performance of logistic regression classifier in relation to data imbalance. The study [Provost 2000] assumes that in majority of published work the performance of logistic learner would be significantly improved, if it is adequately used. We will refer to this issue in more detail in Section 3.

As in the whole software engineering field, an important problem in software defect prediction is the lack of quality industrial data, and therefore generalization ability and further propagation of research results is very limited. The problem is usually that this data are considered as confidential by the industry, or the data are not available at all for industry with low maturity. To overcome these obstacles, there are initiatives for open source repositories of datasets aligned with the goal of improving generalization of research results. However, the problem of generalization still remains, because usually the open repositories contain data from a particular type of software (e.g. NASA MDP repository, open source software repositories, etc.) and/or of questionable quality [Gray et al. 2011].

In this study we used NASA MDP datasets and have carefully addressed all the potential issues, i.e. removed duplicates [Gray et al. 2012]. This selection is motivated by simple comparison of results with the related work, so that our contribution can be easily incorporated to the existing knowledge base of imbalance problem in the SDP area. 1.2 Experimental approach Our goal is to explore stability of evaluation metrics for learning SDP datasets with machine learning techniques across different levels of imbalance. Moreover, we want to evaluate potential sources of bias in study design by constructing number of experiments in which we diverse one parameter per experiment. Parameters that are subject of change are explained briefly in Sect.2.

To integrate conclusions obtained from each experiment a meta–analytic statistical analysis is proposed. These methods are suggested by number of authors as tool for generalizing the results and integrating knowledge across many studies [Brooks 1997]. We propose the following steps: (1) Acquiring data. A sample S of independent random variables X1; : : : ; Xn measuring different features of a system module, and a binary dependent variable Y measuring fault–proneness (with Y = 1 for FP modules and Y = 0 for NFP modules) is obtained from a repository (e.g. open repository, open source projects, industrial projects). (2) Data preprocessing.

(a) Data cleaning, noise elimination, sampling. (b) Data multiplication. From the sample S obtained in step (1) a training set of size 2=3 the size of S and a validation set of size 1=3 the size of S are chosen at random k times. In this way k training samples T1; : : : ; Tk and k validation samples V1; : : : ; Vk are obtained. These samples are categorized into ` categories with respect to the data imbalance defined as the percentage F PTi of the NFP modules in Ti and calculated as: %N F PTi = F PTi +NF PTi . (c) Feature selection. For each training set Ti a feature selection is performed. As a result some of the random variables Xj are excluded from the model. The inclusion/exclusion frequencies of Xj for each of the categories introduced in step (2b) are recorded. (3) Learning.

(a) Building a learning model. A learning model is built for each training set Ti using the learning techniques under consideration. (b) Evaluating model performance. Using the validation set Vi, the model built in step (3a) is evaluated using various evaluation metrics. Let M be the random variable measuring the value of one of these metrics. (4) Statistical analysis. (a) Variation analysis. The differences between ` samples of a random variable M obtained from samples Ti and Vi belonging to different categories introduced in step (2b) are analyzed using statistical tests. This step is repeated for each evaluation measure used in step (3b). (b) Cross-dataset validation. The whole process is repeated from step (1) for m datasets from various application domains and sources. The differences between ` m samples of a random variable M are analyzed using statistical tests and the results reveal whether general behavior exists.

To summarize, the conclusions are based on the results of statistical tests comparing the mean values of performance evaluation metrics (see Table I) across different data imbalances of a training sample. The stability of performance evaluation metrics obtained with different feature selection procedures is evaluated in the same way. 2. DATA IMBALANCE Data imbalance has received considerable attention within the data mining community during the last decade. It becomes a central point of this research, since the problem is present in a majority of data mining application areas [Weiss 2004]. In general data imbalance degrades the learning performance. The problem arises with learning accuracy of the minority class, in which we are usually more interested. Usually, we are interested to timely predict rare events represented by the minority class, for which the probability of its occurrence is low, but its occurrence leads to significant costs.

For example, suppose that only very low number of system modules is faulty, which is the case with systems with very low tolerance on failures (e.g. medical systems, aeronautic system, telecommunications, etc.). Suppose that we did not identify faulty module with the help of a software defect prediction algorithm, and due to that have developed defect detection strategy not concentrating on that particular module. Thus, we omit to identify a fault in our defect detection activity, and this fault slips to the customer site. Failure caused by this fault at customer site would then imply significant costs contained of several items: paying penalty to customer, losing customer confidence, causing additional expenses due to corrective maintenance, additional costs in all subsequent system revisions and additional cost during system evolution. This cost would be considered as misclassification cost of wrongly classified positive class (note that positive class in the context of defect prediction algorithm is a faulty module). On the other hand, misclassification cost of wrongly classified negative class would be much lower, because it would involve just more defect detection activities. Obviously, the misclassification costs are unequally weighted and this is the main obstacle in applying standard machine learning algorithms, because they usually assumes the same or similar conditions in learning and application environment [Provost 2000].

The study [Provost 2000] makes a survey of data imbalance problems and methods addressing these problems. Although different methods are recommended for data imbalance problems, it does not give definite answers regarding their applicability in the application context. Some answers are obtained by other researchers in that field afterwards, and a more recent survey is given in [He and Garcia 2009]. Still no definite guideline exists that could guide practitioners. 2.1 Dataset considerations The most popular approach to the class imbalance problem is the usage of artificially obtained balanced dataset. There are several sampling methods proposed for that purpose. In a recent work [Wang and Yao 2013] an experiment with some of the sampling methods is conducted. However, it is concluded in [Kamei et al. 2007] that sampling did not succeed to improve performance with all the classifiers. In [Hulse et al. 2007] it is identified that classifier performance is improved with sampling, but individual learners respond differently on sampling.

Another problem with datasets is that in practice, the datasets are often very complex, involving a number of issues like overlapping, lack of representative data, within and between class imbalance, and often high dimensionality. The effects of these issues were widely analyzed separately sample size in [Raudys and Jain 1991], dimensionality reduction: [Liu and Yu 2005], noise elimination [Khoshgoftaar et al. 2005], but not in conjunction with the data imbalance. The study performed in [Batista et al. 2004] observes that the problem is related to a combination of absolute imbalance and other complicating factors. Thus, the imbalance problem is just an additional issue in complex datasets such as datasets for software defect prediction.

Different aspects of feature selection in relation to class imbalance has been studied in [Khoshgoftaar et al. 2010; Gao and Khoshgoftaar 2011; Wang et al. 2012]. All these studies were performed on datasets from the NASA MDP repository. In this work we also used a stepwise feature selection as a preprocessing step, because the dataset is high dimensional and we experiment with logistic regression. Hence, we were able to investigate the stability of the performance with and without feature selection procedure over different levels of imbalance.

Besides the methods explained above for obtaining artificially balanced datasets, another approach is to adapt standard machine learning algorithms to operate for imbalance datasets. In that case the learning approach should be adjusted to the imbalanced situation. A complete review of such approaches and methods can be found in [He and Garcia 2009]. 2.2 Evaluation metrics Another problem of standard machine learning algorithms for imbalanced data is in usage of inadequate evaluation metrics during learning procedure or to evaluate final result. Evaluation metrics are usually derived from the confusion matrix and are given in Table I. They are defined in terms of the following score values. A true positive (TP) score is counted for every correctly (true) classified fault-prone module, and a true negative (TN) score for every correctly (true) classified non-fault-prone module. The other two possibilities are related to false prediction. A false positive (FP) score is counted for every false classified or misclassified non-fault-prone module (often referred to as Type II error), and a false negative (FN) score is counted for every false classified or misclassified fault-prone module (often referred to as Type I error) [Runeson et al. 2001; Khoshgoftaar and Seliya 2004]. For example, classification accuracy ACC, the most commonly used evaluation metric in standard machine learning algorithms, is not able to value the minority class appropriately, and leads to poor classification performance of minority class.

In the case of class imbalance, the precision (PR) and recall (TPR) metrics given in Table I are recommended in number of studies [He and Garcia 2009], as well as the F –measure and G–mean which are not used here. The precision and recall in combination give a measure of correctly classified fault–prone modules. Precision measures exactness, i.e., how many fault–prone modules are classified correctly, and recall measures completeness, i.e., how many fault–prone are classified correctly.

Metrics Accuracy (ACC) True positive rate (TPR) (sensitivity, recall) Precision (PR) (positive predicted value)

The output of a probabilistic machine learning classifier is the probability for a module to be faultprone. Therefore, a cutoff percentage has to be defined in order to perform classification. Since choosing a cutoff value leaves room to bias and possible inconsistencies in a study [Lessmann et al. 2008], there is another measure that deals with that problem called the area under curve, AUC [Fawcett 2006]. It takes into account the dependence of T P R and a similar metric for false positive proportion on the cutoff value.

All of the aforementioned techniques are not cost sensitive, and in the case of rare cases with very high misclassification cost of type I error the key performance indicator is cost. The most favorable evaluation criteria for imbalanced datasets are cost curves and is also recommended in [Jiang et al. 2008] for SDP domain. 3. PRELIMINARY CASE STUDY To illustrate the application of the research strategy proposed in Section 1.2, verify strategy, provide evidence for the dependence of the machine learning performance on the level of data imbalance, and indicate our future goals, we have undertaken a preliminary case study. (1) Dataset KC1 from NASA MDP repository has been acquired. It consists of n = 29 features, i.e., independent variables Xj. The dependent variable in this dataset is the number of faults in a system module. From this variable we derived binary dependent variable Y by setting ten different thresholds for fault proneness, from 1 to 19 with step of 2 (1, 3, 5,...). In this way we obtained ten different samples S and we continue the analysis for all of them. (2) (a) The well known issues with the dataset are eliminated using data cleaning tool [Shepperd et al.

2013]. (b) For each of the ten samples obtained in step (1), we made 50 iterations of the random splitting into training and validation samples. Thus we obtained k = 500 samples Ti and Vi with the range of data imbalance from 51% to 96%. The samples are categorized into ` = 5 categories of equal length (each spanning 9%). (c) In the case study we also consider the influence of a feature selection procedure, as already mentioned in 2. We consider the forward and backward stepwise selection procedure [Han and Kambar 2006]. The decision for inclusion and exclusion of a feature is based on level of statistical significance, the p value. The common significance levels for inclusion and exclusion of features are used as in [Mausa et al. 2012; Briand et al. 2000] with p in = 0:05 and p out = 0:1 respectively. The percentage of inclusion of a feature for both procedures and different categories of data imbalance are given in Table II. We conclude that feature selection stability of some features is very tolerant to data imbalance (e.g. Feature 5, 22, 28, 29 is always excluded, for both forward and backward model). Some features are very stable until certain level of balance (for example Feature 2 is always included 100% until category with data imbalance of 78%). It is also interesting to observe that some features have similar feature selection stability in ideal balance case and highly imbalanced case, whereas for moderate imbalance have opposite feature selection decision. (3) (a) Learning models are built using multivariate binary logistic regression (LR) [Hastie et al. 2009]. The model incorporates more than just one predicting variable and in fault predicting case performs according to the equation (X1; X2; :::Xn) = 1 + eC0+C1X1+:::+CnXn ; eC0+C1X1+:::+CnXn (1) where Cj are the regression coefficients corresponding to Xj, and is the probability that a fault was found in a class during validation. In order to obtain a binary outgoing variable, Ft. 1 Ft. 2 Ft. 3 Ft. 4 Ft. 5 Ft. 6 Ft. 7 Ft. 8 Ft. 9 Ft. 10 Ft. 11 Ft. 12 Ft. 13 Ft. 14 Ft. 15 Ft. 16 Ft. 17 Ft. 18 Ft. 19 Ft. 20 Ft. 21 Ft. 22 Ft. 23 Ft. 24 Ft. 25 Ft. 26 Ft. 27 Ft. 28 Ft. 29 a cutoff value splits the results into two categories. Researchers often set the cutoff value to imbalance and this robustness is achieved with setting of cutoff value to optimal value dependent on misclassification costs [Basili et al. 1996]. Our goal is to explore learning performance over different imbalance levels. However, in this study, due to space limitation, we provide preliminary results exploring learning performance stability of standard learning algorithms. Therefore, we provide results of experiments with cutoff value set to 0.5 (that is how standard learning algorithms equally weight misclassification costs). We considered there three different models (with forward feature selection, backward feature selection and without feature selection) and for each of these models, the coefficients are calculated separately. (b) For all validation samples from step (2b) we count the TN, TP, FN and FP scores of the corresponding model, and calculate the learning performance evaluation metrics ACC, TPR (Recall), AUC and Precision using formulas in Table I. (4) We made a statistical analysis of the behavior of evaluation metrics measured in step (3b) between different categories introduced in step (2c). Since the samples are not normally distributed, we used the non-parametric tests. The Kruskal-Wallis test showed for all metrics that the values depend on the category. To explore the differences further, we applied multiple comparison test. It reveals that all considered evaluation metric become unstable at the level of imbalance of 80%. According to the theory explained in section 2, we expect that we will get significantly different mean values for all metrics in category of highest data imbalance (90% - 100%). 1:8 4. DICUSSION Data imbalance problem has been widely investigated and there were numerous approaches studying its effects aiming to propose a general solution to that problem. However, from the experiments in machine learning theory it becomes obvious that this is not only related to proportion of minority over majority class but there are also other influences present in complex datasets. As the datasets in software defect prediction (SDP) research area are usually extremely complex, there is a huge unexplored area of research related to applicability of these techniques in relation to the level of data imbalance. That is exactly our main motivation for this work.

There are many approaches, depending on particular dataset, to SDP and development of the learning model. Since we are interested in the performance stability of machine learners over SDP datasets, we should rigorously explore the strengths and limitations of these approaches in relation to the level of data imbalance. Therefore, we present an exploratory research strategy and an example of a case study performed according to this strategy. Although, we use our experiment to eliminate as much as possible inconsistencies and threats of applying the strategy, there is still place for improvement.

In our case study we present how performance stability is significantly degraded at a higher level of imbalance. This confirms the results obtained by other researchers using different approaches. That conclusion have proved reliability of our strategy. Moreover, with the help of our research strategy we confirmed that feature selection becomes instable with higher data imbalance. We have also observed that the feature selection is consistent across levels of imbalance for some features.

Future work should involve extensive exploration of SDP datasets with the proposed strategy. Our vision is that at the end we can gain deeper knowledge about imbalanced data in SDP and applicability of techniques in different levels of imbalance. Finally, we would like to categorize datasets using the proposed strategy and results of this exhaustive research that would serve as a guideline for practitioners while developing software defect prediction model. D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. The misuse of the nasa metrics data program data sets for automated software defect prediction. Processing, pages 96–103, 2011.

D. Gray, D. Bowes, N. Davey, Y. Sun and B. Christianson. Reflections on the NASA MDP data sets. IET Software, pages 549 5583, 2012.

T. Galinac Grbac, P. Runeson, and D. Huljenic. A second replicated quantitative analysis of fault distributions in complex software systems. IEEE Transactions on Software Engineering, 39(4):462–476, 2013.

T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. Software Engineering, IEEE Transactions on, 38(6):1276–1304, 2012.

J. Han and M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006.

T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, 2009.

H. He and E. A. Garcia. Learning from Imbalanced Data. IEEE Trans. Knowledge and Data Engineering, 21(9):1263-1284, 2009.

J. Hulse, T. Khoshgoftaar, A. Napolitano. Experimental perspectives on learning from imbalanced data. In in Proc. 24th international conference on Machine learning (ICML ’07), pages 935–942. 2007.

Y. Jiang, B. Cukic, and Y. Ma. Techniques for evaluating fault prediction models. Empirical Softw. Engg., 13:561–595, October 2008.

Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, K. Matsumoto. The Effects of Over and Under Sampling on Fault-prone Module Detection. In in Proc. ESEM 2007, First International Symposium on Empirical Software Engineering and Measurement, pages 196–201. IEEE Computer Society Press, 2007.

I. Kaur and A. Kaur. Empirical study of software quality estimation. In Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology, CCSEIT ’12, pages 694–700, New York, NY, USA, 2012.

ACM.

T. M. Khoshgoftaar, E. B. Allen, R. Halstead, and G. P. Trio. Detection of fault-prone software modules during a spiral life cycle.

In Proceedings of the 1996 International Conference on Software Maintenance, ICSM ’96, pages 69–76, Washington, DC, USA, 1996. IEEE Computer Society.

T. M. Khoshgoftaar and N. Seliya. Comparative assessment of software quality classification techniques: An empirical case study. Empirical Softw. Engg., 9(3):229–257, Sept. 2004.

T. M. Khoshgoftaar, N. Seliya, K. Gao. Detecting noisy instances with the rule-based classification model. Intell. Data Anal., 9(4):347–364, 2005.

T. M. Khoshgoftaar, K. Gao, N. Seliya. Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction In

Proceedings: the 22nd IEEE International Conference on Tools with Artificial Intelligence, 137-144, 2010.

H. Liu, L. Yu. Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Trans. on Knowl. and

Data Eng., 17(4):491–502, 2005.

S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4):485–496, 2008.

G. Mausa, T. Galinac Grbac, and B. Basic. Multivariate logistic regression prediction of fault-proneness in software modules. In

MIPRO, 2012 Proceedings of the 35th International Convention, pages 698–703, 2012.

T.J. McCabe. 1976. A complexity measure. IEEE Transactions on Software Engineering, 2:308–320, 1976.

N. Ohlsson, M. Zhao, and M. Helander. Application of multivariate analysis for software fault prediction. Software Quality

Control, 7:51–66, May 1998.

F. Provost. Machine Learning from Imbalanced Data Sets 101. In Proc. Learning from Imbalanced Data Sets: Papers from the

Am. Assoc. for Artificial Intelligence Workshop, Technical Report WS-00-05, 2000.

S. J. Raudys, A. K. Jain. Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners. IEEE

Trans. Pattern Anal. Mach. Intell., 13(3):252–264, May 1991.

P. Runeson, M. C. Ohlsson, and C. Wohlin. A classification scheme for studies on fault-prone components. In Proceedings of the Third International Conference on Product Focused Software Process Improvement, PROFES ’01, pages 341–355, London, UK, 2001. Springer-Verlag.

M. Shepperd and G. Kadoda. Comparing software prediction techniques using simulation. IEEE Trans. Softw. Eng., 27(11):1014– 1022, Nov. 2001.

M. Shepperd, Q. Song, Z. Sun, C. Mair Data Quality: Some Comments on the NASA Software Defect Data Sets. IEEE Trans.

Softw. Eng., http://doi.ieeecomputersociety.org/10.1109/TSE.2013.11, Nov. 2013. 1:10 H. Wang, T. M. Khoshgoftaar, and A. Napolitano. An Empirical Study on the Stability of Feature Selection for Imbalanced Software Engineering Data. In Proceedings of the 2012 11th International Conference on Machine Learning and Applications - Volume 01, ICMLA ’12, pages 317–323, Washington, DC, USA, 317–323.

S. Wang and X. Yao. Using Class Imbalance Learning for Software Defect Prediction. IEEE Transactions on Reliability, 62(2):434 - 443, 2012.

G.M. Weiss. Mining with rarity: a unifying framework. In SIGKDD Explor. Newsl., 6(1):7–19, 2004.

T. Zimmermann and N. Nagappan. Predicting defects using network analysis on dependency graphs. In Proceedings of the 30th international conference on Software engineering, ICSE ’08, pages 531–540, New York, NY, USA, 2008. ACM. Enhancing Software Quality in Students’ Programs STELIOS XINOGALOS, University of Macedonia MIRJANA IVANOVIĆ, University of Novi Sad This paper focuses on enhancing software quality in students’ programs. To this end, related work is reviewed and proposals for applying pedagogical software metrics in programming courses are presented. Specifically, we present the main advantages and disadvantages of using pedagogical software metrics, as well as some proposals for utilizing features already built in contemporary programming environments for familiarizing students with various software quality issues. Initial experiences on usage of software metrics in teaching programming courses and concluding remarks are also presented.

Categories and Subject Descriptors: D.2.8 [Software Engineering]: Metrics – Complexity measures; K.3.2 [Computers and Education]: Computer and Information Science Education – Computer science education General Terms: Education, Measurement Additional Key Words and Phrases: Pedagogical Software Metrics, Quality of Students’ Software Solutions, Assessments of Students’ Programs 1. INTRODUCTION Teaching and learning programming presents teachers and students respectively with several challenges. Students have to comprehend the basic algorithmic/programming constructs and concepts, acquire problem solving skills, learn the syntax of at least one programming language, and familiarize with the programming environment and the whole program development process. Moreover, students nowadays have to familiarize with the imperative and object-oriented programming techniques and utilize them appropriately. The numerous difficulties encountered by students regarding these issues have been recorded in the extended relevant literature. Considering time restrictions, large classes and increasing dropout rates the chances to add more important software development aspects in introductory programming courses, such as software quality aspects, seems to be a difficult mission.

On the other hand, several empirical studies regarding the development of real-world software systems have shown that 40% to 60% of the development resources are spent on testing, debugging and maintenance issues. It is clear both for the software industry and those teaching programming that the students should be educated to write code of better quality. Several efforts have been made from researchers and teachers towards achieving this goal. These efforts focus mainly on: − adjusting widely accepted software quality metrics for use in a pedagogical context, − devising special tools that carry out static code analysis of students’ programs.

This paper focuses on studying the related work and making some proposals for dealing with software quality in students’ programs. Specifically, we propose utilizing features already built in contemporary programming environments used in our courses, for presenting and familiarizing students with various software quality issues without extra cost. Of course, using pedagogical software metrics is not an issue that refers solely to pure programming courses. However, since students formulate their programming style in the context of introductory programming courses, it is important to introduce pedagogical software metrics in such courses and then extend on other software engineering, information systems and database courses, or generally in courses that require from students to develop software. The rest of the This work was partially supported by the Serbian Ministry of Education, Science and Technological Development through project Intelligent Techniques and Their Integration into Wide-Spectrum Decision Support, no. OI174023 and by the Provincial Secretariat for Science and Technological Development through multilateral project “Technology Enhanced Learning and Software Engineering”.

Authors’ addresses: S. Xinogalos, Department of Applied Informatics, School of Information Sciences, 156 Egnatia str., 54006 Thessaloniki, Greece, email: stelios@uom.gr; M. Ivanović, Department of Mathematics and Informatics, Faculty of Sciences, Trg Dositeja Obradovića 4, 21000 Novi Sad, Serbia, email: mira@dmi.uns.ac.rs paper is organized as follows. In the 2nd section we refer to adequate related work. Section 3 considers usage of pedagogical software metrics. In section 4 we present some initial experiences of usage of software metrics in teaching programming courses. Last section brings concluding remarks. 2. RELATED WORK When we refer to commercial software quality a long list of software metrics exists that includes basic metrics and more elaborated ones, as well as combinations and variations of them. Highly referenced basic metrics are: (i) the Healstead metric that is used mainly for estimating the programming effort of a software system in terms of the operators and operands used; and (ii) the McCabe cyclomatic complexity measure that analyzes the number of the different execution paths in the system in order to decide how complex, modular and maintainable it is.

The problem with such metrics is that they have not been developed for use in a pedagogical context. As [Patton and McGill 2006] state such metrics have the potential to be utilized for analysis of students’ programs, but they have specific shortcomings: several metrics give emphasis on the length of the code irrespectively of its logic and do not differentiate between various uses of language features, such as a for versus a while loop, or a switch-case versus a sequence of if statements. When we talk about students’ programs it is clear that as educators we consider the logic of a program more important than its length, while the appropriate utilization of language features is one of the main goals of introductory programming courses. In this sense, researchers have proposed software metrics specifically for analyzing student produced software.

One such framework has been proposed by [Patton and McGill, 2006] and includes the following elements: [1] language vocabulary: use of targeted language constructs and elements (e1); [2] orthagonality/encapsulation: of both tasks (e2) and data (e3); [3] decomposition/modularization: avoiding duplicates of code (e4) and overly nested constructs (e5); [4] indirection and abstraction (e6); [5] polymorphism, inheritance and operator overloading (e7).

Patton and McGill [2006] devised this framework in the context of a study regarding optimal use of students’ software portfolios and propose attributing its elements to specific pedagogical objectives, and weighting them according to the desired outcomes of the institution and instructor.

Another recent study aimed at devising a list of metrics for measuring static quality of student code and at the same time utilizing it for measuring quality of code between first and second year students. In this study, seven code characteristics (in italics) that should be present in students’ code are analyzed in 22 properties, as follows [Breuker et al. 2011]: [1] size-balanced: (p1) number of classes in a package; (p2) number of methods in a class; (p3) number of lines of code in a class; (p4) number of lines of code in a method, [2] readable: (p5) percentage of blank lines; (p6) percentage of (too) long lines, [3] understandable: (p7) percentage of comment lines; (p8) usage of multiple languages in identifier naming; (p9) percentage of short identifiers, [4] structure: (p10) maximum depth of inheritance; (p11) percentage of static variables; (p12) percentage of static methods; (p13) percentage of non-private attributes in a class, [5] complexity: (p14) maximum cyclomatic complexity at method level; (p15) maximum level of statement nesting at method level, [6] code duplicates: (p16) number of code duplicates; (p17) maximum size of code duplicates, [7] ill-formed statements: (p18) number of assignments in an ‘if’ or ‘while’ condition; (p19) number of ‘switch’ statements without ‘default’; (p20) number of ‘breaks’ outside a ‘switch’ statement; (p21) number of methods with multiple ‘returns’; (p22) number of hard-coded constants in expressions.

Some researchers have moved a step forward and have developed special tools that perform static analysis of students’ code. Two characteristic examples are CAP [Schorsch 1995] and Expresso [Hristova et al. 2003]. CAP (“Code Analyzer for Pascal”) analyzes programs that use a subset of Pascal and provides user-friendly and informative feedback for syntax, logic and style errors, while Expresso aims to assist novices writing Java programs in fixing syntax, semantic and logic errors, as well as contributing in acquiring better programming skills.

Several other tools have been developed with the aim of automatic grading of students’ programs in order to provide them with immediate feedback, reducing the workload for instructors and also detecting plagiarism [Pribela et al. 2008, Truong et al. 2004]. However, in most cases these environments are targeted to specific languages, such as CAP for Pascal and Expresso for Java. A platform and programming language independent approach is presented in [Pribela et al. 2012]. Specifically, the usage of software metrics in automated assessment is studied using two language independent tools: SMILE for calculating software metrics and Testovid for automated assessment.

However, none of these solutions has gained widespread acceptance. Our proposal is to utilize features of contemporary programming environments and tools in order to teach and familiarize students with important aspects of software quality, as well as help them acquire better programming habits and skills without extra cost. Usually features of this kind are not utilized appropriately, although they provide the chance to help students increase the quality of their programs easily. 3. USING PEDAGOGICAL SOFTWARE METRICS 3.1 Advantages and Disadvantages Pedagogical software metrics can be applied with various ways in courses having a software aspect with the ultimate goal of developing better quality software. Specifically, they can be given to students just as guidelines to follow in order to develop quality code, or as factors that count towards grading their software products. In the latter case it is clear that a considerable amount of time should be devoted in training students in comprehending and applying the selected software metrics. On the one hand this is important to take place even from introductory programming courses, since this is the time when students formulate their “good” or “bad” programming style/habits that is not easy to change in the future. On the other hand, novices have several difficulties to deal with when introduced to programming and adding formal rules regarding software metrics might not be a good choice at least not for all students. Moreover, adding more material in introductory programming courses is not easy in terms of both time and volume of material.

Several researchers and instructors have integrated software metrics in systems used for automatic checking of software developed by students, either for grading their programs or/and for detecting plagiarism. The advantages are several. First of all, students can get immediate feedback about their achievements and be supported in overcoming their difficulties and misconceptions, while grading is fair. Secondly, instructors save a great deal of time from correcting programs, a process that in the case of large classes and many practical exercises is extremely time-consuming. Of course, developing such tools is also not easy and requires a great deal of time and effort. 3.2 The Educational IDE BlueJ The educational programming environment BlueJ is a widely known environment used in introductory object oriented programming courses, since it offers several pedagogical features that assist novices. These features can be appropriately utilized for teaching and familiarizing students with software quality aspects described in the previous section and helping them acquire better programming habits.

Editor features. The editor of BlueJ provides some features that can help students firstly appreciate a good style of programming and secondly inspect their code for the existence of properties proposed in the framework by [Breuker et al., 2011] or the elements proposed by [Patton and McGill, 2006], or other similar frameworks. These features are: − line numbers that can be used for a quick look on the lines of code in methods (p4) and classes (p3) if the instructor considers it important and provides students with relevant measures for a project − auto-indenting and bracket matching help students write code that is better structured and more readable. However, several times students do not consider style so important and they write endless lines of code with no indentation and no distinction between blocks of code. In the case of errors that are so common in student’s programs, this lack of structure makes the detection of errors difficult especially in the case of nested constructs (e5). The instructor can easily convey this concept to students by presenting students such a program (or using their own ones) and using the automatic-layout ability provided for BlueJ for presenting them the corresponding program with proper indentation in order to help them realize the difference in practice. − syntax-highlighting can help students easily inspect their code for ill-formed statements (p18-p21).

However, the instructor has to make students comprehend that they have to inspect the code they write and not just compile and run it. Syntax-highlighting, for example, can help students easily detect a sequence of ‘if’ statements that should be replaced by an “if/else if..’ or ‘switch’ construct. − scope highlighting that is presented with the use of different background colors for blocks of code should be – in the same sense as above – utilized for a quick inspection of nested constructs (e5) and level of statement nesting at method level (p15) in order to avoid increased complexity. The instructor can give students some maximum values to have in mind and ring them a bell for reconsidering the decomposition/modularization of their solution. − method comments can be easily added in students’ code. When the cursor is in the context of a method and the student invokes the ‘method comment’ choice a template of a method comment is added in the source code containing the method’s name, java doc tags and basic information regarding parameters, return types and so on. Students must understand that comments (p7) produce more readable and maintainable code and also can be used for producing a more comprehendible and valuable documentation view of class. This interface of a class is important in project teams and the development of real world software systems.

Moreover, if instructors think that a more formal approach should be adopted towards checking coding styles the BlueJ CheckStyle extension [Giles and Edwards] can be used. This extension is a wrapper for the CheckStyle (release v5.4) development tool and allows the specification of coding styles in an external file.

Incremental development and testing. Students tend to write large portions of code before they compile and test it, increasing this way the possibility for error-prone code of less quality. We consider that it is important to develop and test a program incrementally in order to achieve better quality code. BlueJ offers some unique possibilities for novices towards this direction. Specifically, the ability of creating objects and calling methods with direct manipulation techniques makes incremental development and testing an easy process. Students are encouraged to create instances of a class (by right-clicking on it from the simplified UML class diagram of a project presented in the main window of BlueJ) and call each method they implement for testing its correctness. Students can even call a method by passing it - with direct manipulation techniques - references to objects existing and depicted in the object bench. This makes incremental developing and testing of each method much easier and less time consuming. The invocation of methods should always be done with the object inspector of each object active, in order to check how the object’s state is affected and also how it affects method execution. Students should be encouraged to use the object inspector to check: encapsulation of data (e3); static variables (p11); private and non-private attributes (p13). It is not unusual for students to write code mechanically and so it is important for them to learn to inspect afterwards what they have written. This also stands out for methods as well. The pop-up menu with the available methods for an object of a class, shows explicitly the public interface of a class and can help novices comprehend public and private access modifiers in practice and utilize them appropriately. Also, the dialog box that appears when a student creates an object or calls a method for an object, “asks” the student to enter a value of the appropriate type for each parameter and helps students realize whether their choices of parameters were correct (i.e. a parameter is missing or it is not needed). Students can experiment with all the aforementioned concepts by writing the corresponding statements in the Code pad incorporated in the main window of BlueJ.

Visualization of concepts. The main window of BlueJ presents students with a simplified UML class diagram giving an overview of a project’s structure. Specifically, the following information is presented: name of each class; type of class (concrete, abstract, interface, applet, enum); ‘uses’ and ‘inheritance’ relation. This UML class diagram can be used for getting an overview of a project either it is given to students for studying it or it is developed by students themselves. Students can easily inspect the overall structure of a project, the number of classes (p1) and the depth of inheritance (p10). Students should also be encouraged to inspect the UML class diagram in order to: detect classes representing Redefining Software Quality Metrics to XML Schema Needs MAJA PUŠNIK, BOŠTJAN ŠUMAK AND MARJAN HERIČKO, University of Maribor ZORAN BUDIMAC, University of Novi Sad The structure and content of XML schemas, important and widely used document definitions, has a significant influence on the quality of XML data and XML technologies in general, therefore the quality of XML Schemas and accurate assessment of the quality is a fundamental research challenge in all fields of XML application. A good quality estimation of an XML schema can directly and indirectly lead to a higher efficiency of its usage, simplification of information solutions, efficient maintenance, and higher quality of data and business processes. This paper addresses challenges in measuring the level of XML schema quality by employing general software quality metrics; a set of holistically defined and document-oriented metrics is proposed. Proposed XML Schema quality metrics base on existing software metrics, adapted according to needs of XML schemas, addressing it mostly from a structural perspective.

Categories and Subject Descriptors: H.0. [Information Systems]: General; D.2.8 [Software Engineering]: Metrics — Complexity measures; Product metrics; D.2.9. [Software Engineering]: Management — Software quality assurance (SQA) General Terms: Software quality assurance Additional Key Words and Phrases: software metrics, quality metrics, XML Schema 1. INTRODUCTION The primary role of XML schemas is the definition of XML data and supporting rules regarding the use of XML data, an important part of information technologies. XML schemas and related technologies present an important part of IT solutions in most Slovenian companies [ Sušnik 2008 ], EU and the world [Rishel 2011]. Using XML has spread from the field of e-business and data exchange to data presentation into various levels of contemporary information solution architectures: (1) web service interface definitions, (2) data models, (3) specification of business cooperation protocols between different companies (their many uses are evident from different scientific and technical papers), etc.. Due to the widespread use, the question of XML schema quality is often open, particularly from the aspect of structure (and content) of XML schemas, which indirectly influence the quality of data that XML schema describes. Therefore measuring XML schemas quality is the basic research challenge in our paper. Solution of the problem (the composite of metrics) will directly or indirectly lead to greater efficiency in the use of XML schemes, simplifying IT solutions, facilitating maintenance, improving the quality of data and associated business processes. Ideally the metrics should apply the aspect of structure, content and domain, in which the XML schema is applied, however this paper will focus mostly on structural aspect, trying to take advantage of existing software metrics.

There have been several attempts to evaluate and measure XML schemas. Few of them are summed in [Zhang 2008 ]. Significantly related work was also done in [McDowell, Schmidt, Yue 2004] and [ Narasimhan, Hendradjaya 2007 ], where attempts to measure XML schemas as well as software in general were made. The subject was addressed in other papers, not included in this overview, however the background are mainly software metrics, which do not necessary always apply needs of XML schema quality (and complexity) measurements.

Based on surveys and interviews, conducted within the University of Maribor and nearby companies, XML Schemas are often built irrationally in a manner, which satisfies the minimum requirements of syntactic correctness and content sufficiency. Existing metrics only partially address the problem basing on existing solutions known in software engineering and not addressing the problem of an objective quality evaluation of an XML Schema. Dynamic creation and adaptation of XML schemas schedules and presents an additional research challenge that requires the use of new approaches and solutions, universal and specific according to a domain.

The aim of this paper is definition of a new theoretical approach for evaluating the quality of XML Schema, basing on the original concept of semantically related analysis of XML schemes and XML documents, by using a new set of metrics. The design correctness of the newly redefined metrics was confirmed on an expanded set of test data of already established XML schemes in the field of e-business and integration of complex business information systems. For quality measurement purposes we gathered quality parameters, addressing different aspects of XML Schema needs and demands.

This paper is organized into four chapters. After the presentation of this papers background and the description of included XML quality parameters, chapter two presents all aspects in metric types. Chapter 3 presents metric application and chapter four includes discussion of our present work and future plans. 1.1 XML schema quality parameters The results of a systematic review of literature in the field of measuring XML schemas showed that several metrics were applied to XML schema evaluation, extracted mainly from the methods of software engineering measurements, focusing mostly on the complexity of XML Schemas. To include a variety of parameters addressing complexity and quality, we searched different fields on quality measurement. The first group of parameters was related to the structural characteristics of XML schemes (we included a survey, where all currently defined metrics are taken from several authors in [Zhang 2008]) : - XML schema size, - Number of XML nodes and annotations, - Number of global and local element declarations, - Number of global or local complex types definitions, - Number of derived complex types, number of global and local definitions of simple types, - Number of global or local definitions of models groups (groups), - Number of global or local definitions of groups of attributes, - Branch elements, the average cardinality of elements, etc.

Pleasant use Expert revised Flexible and extendable Well connected

Well structured Fig. 1 Quality hierarchy in XML schemas

The typically software metrics parameters were extended with parameters form other quality measurement fields, specifically taken from standards ISO (ISO/IEC 9126 [McDowell, Schmidt, Yue 2004]) , decision models theory [ Burris 2012 ] and other papers [Zhang 2008]): - XML schemas functionality - XML schemas simplicity - XML schemas scalability - XML schemas comprehensibility - XML schemas re-use,

XML schemas fullness, XML schemas integrability, XML schemas Flexibility, XML schemas Implementation, XML schemas Maintenance, Accuracy, Validity, Up to date, Minimalism, Consistency, Portability Security, Interoperability Reliability, Effectiveness,

Visibility

To determine the quality levels of XML schema usage, we borrowed Maslow’s hierarchical nature needs, which can be applied to software and to all supporting technologies, presenting our interpretation in Fig. 1. The gathered parameters were organized into six groups, reflecting six identified XML schema needs respectively XML schema quality demands, meeting the three main XML schema demands: (1) good structure, (2) consistent contents, (3) compliant with domain. All parameters, contributing to XML schema quality and all aspects of quality are combined in Fig. 2. Fig. 2 Quality aspects in XML schemas

Fig. 3 Quality-complexity dependance 2. METRIC TYPES So that individual metrics could be compared, NORMALIZATION of parameters was conducted. All the parameters that were used within the metrics and their results were transformed to a scale of 0 to 1, where 0 represented the worst value for each parameter and 1 the best value. The transformations based on linear programming, assuming that the growth relationship is linear. The following metrics address all aspects of XML schema quality. 2.1 Structural aspect Other authors have researched measuring the structure of XML schemes for calculating the complexity and quality by McDowell and others [ Burris 2012 ]. The authors present a number of metrics, taken mainly from "quality model" ISO standard and link them into a single formula. Each variable is further multiplied, however the factors are not justified, values are not normalized, so the formula cannot be applied, but we have analysed and partly used in our calculation formula of quality.

Within the complexity calculations we can conclude that the higher the value of the individual, the greater the complexity (the relationship is shown in Fig. 3). According to XML schema needs we redefined metrics into the following composite metric (1) with the following parameters: - S1 - relationship between simple and complex data types - S2 - relationship between annotations and the number of elements - S3 - average number of restrictions on the declaration of a simple type - S4 - percentage of the derived type declarations of total number of declarations complex types - S5 - diversification of the elements or 'fanning' which is influenced by the complexity of XML schemas suggesting inconsistencies in XML schemas that unnecessarily increase the complexity 2.2 Transparency and documentation of the XML Schema The importance of well documented and easy-to-read/understand XML schema is addressed in the following relationship: number of annotation (NAn) depending on the number of items (NE) and attributes (NAt) illustrates the documentation of XML schemas, supposing that more information about the building blocks increases the quality. The parameters in metric 2 regard transparency and documentation. 1 = (1) (2) (3) 2.3 XML schema optimality In metric 3 we combined several parameters, indicating the optimal structure of an XML Schema. The metric evaluates whether the in-lining pattern has been used, the least preferable one in XML schema building. In doing so, we focus on the following relationships: - (O1) The relationship between local and all elements - (O2) The relationship between local attributes and all attributes - (O3) The relationship between global and complex elements of all the complex elements - (O4) The relationship between global and all the simple elements of simple elements.

Ratio between XML schema building blocks (O1, O2, and O4) should be minimized; meaning minimisation of local elements and attributes and more global simple and complex types; the number of global elements (O3) should be as low as possible, due to the problem of several roots (such flexibility is not always appreciated). This particular parameter differentiates domains into two groups (the flexible ones appropriate to validate multiple different XML schemas, and the strict ones, striving to one root policy for validity or other reasons). In metric 3 we assumed the majority of XML schemas want a certain level of flexibility, therefore the aspect of security was disregarded.

3 =

O1 + O2 + (1 − O3) + O4

4 The metrics, described in the following subchapters, use a similar set of parameters: (NE) Number of elements (NAt) Number of attributes (NAn) Number of annotations (LOC) Number of lines of code (Nre_all) - number of references to elements (simple and complex) (Nra_all) - number of references to attributes (Nrg_all) - number of references to groups (elements and attributes) (Nri_all) - the number of schemes and imported (Ng) - The number of groups 2.4 XML schema minimalism In this metric we combine the parameters that indicate the minimum XML schemas building blocks, where the concept of minimalism is defined as the level, where one can anticipate that there is no other set of less building blocks, however still descriptive full: 2.5 XML schema re use The equation was inspired by author [ Washizaki, Fukazawab 2005 ], where we summed up and defined a set of metrics for measuring the re-use of the software. The metric includes parameters that allow the reuse and are inherently global. We included the following parameters: 4 = +

+ 2.6 XML schema integrability Definition of equation was taken from the idea of density of software components [ Narasimhan 2007 ], where the authors calculate the density of the other segments of the software and the density of interactions between them (lines of code, operations, classes, modules ...).We adjusted and simplified the formula into the following equation: 3. METRICS APPLICATION We tested proposed metrics on a set of 200 XML schemas, subtracted from different domains, acknowledging several standards, available on the market in a certain domain. Each XML schema was evaluated manually and automatically with proposed metrics, eliminating possible duplicates due to crossing of different fields. The results of all metrics were combined and nominated to a scale from 1-3, where a level 1 schema is of high quality and level 3 XML schema is of low quality (using identical scale in case of the manual evaluation). Comparing the two types of evaluation, 83% of data received an equal evaluation (Fig. 4). (4) (5) (6) 2

Manually estimated quality Quality measurement with metrics

All metrics were considered as equal, therefore no priority weights are applied to each metric. This limitation was used due to simplification of our early stage metric framework; weights were omitted for the length purposes, since the paper does not include domain/aspect priorities clarification. We treated all aspects of XML schema as equal due to heterogeneous domain, which were not explored in this paper. Definition of weights will be a part of our future work. For the purposes of this paper, we used the following equation: = 1 + 2 + 3 + 4 + 5 + 6 (7)

A presentation of metrics application is shown in figure (Fig. 5).A sum of 220 real-life standard or semi-standard XML schemas was used to apply defined metrics. Evaluation software produced a resulting XML document with a summary of all data, some warnings or eventual errors and metric results.

Fig. 5 Metric application example based on an XML schema. 4. DISCUSSION The focus of the paper was definition of a full set of parameters for assessing the quality of XML schemes, trying to include all aspects and needs of XML schema quality. We defined six metrics, focusing on important aspects of XML schema quality, and repositioned XML schema facts into parameters, measuring the importance of each building block. To assure correctness, we evaluated each XML schema manually based on a simple overview, noting clearness and readability; and compared our results with metrics’ results. The overlapping was at 83%.

Correct (and quick) measurement of XML Schema quality provides a strategic decision-making and improvement in data organization, as a standard mechanism (internal or global) for evaluation of XML schemes quality. Software metrics are a good basis for XML schema quality measuring, however some accommodations are necessary according to their needs and demands. As users operate with different data from multiple domains of XML technologies application, the quality measurements vary depending on the flexibility (or inflexibility) of structures.

In future work we will further explore applicability of defined metrics, their success and validity on practical examples and the need for metrics adaptability according to the domain in which an XML schema is used. SSQSA Ontology Metrics Front-End1 MILOŠ SAVIĆ, ZORAN BUDIMAC, GORDANA RAKIĆ AND MIRJANA IVANOVIĆ, University of Novi Sad MARJAN HERIČKO, University of Maribor SSQSA is a set of language independent tools whose main purpose is to analyze source code of software systems in order to evaluate their quality attributes. The aim of this paper is to present how a formal language that is not a programming language can be integrated into the front-end of the SSQSA framework. Namely, it is explained how the SSQSA front-end is extended to support OWL2 which is a domain-specific language for the description of ontological systems. Such extension of the SSQSA front-end represents a step towards the realization of a SSQSA back-end which will be able to compute a hybrid set of metrics that reflect different aspects of complexity of ontological descriptions.

Categories and Subject Descriptors: D.2.8 [Software Engineering]: Metrics – Complexity measures; I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods – Representation languages General Terms: Languages, Measurement Additional Key Words and Phrases: OWL2, Ontology metrics, Complexity, SSQSA, eCST representation 1. INTRODUCTION With the rise of the semantic web, ontologies have become a key technology to provide formal description of shared and reusable knowledge. Viewed as “explicit specification of conceptualization” [Gruber 1993], ontologies are used to define concepts and relations present in a domain in order to support reasoning, integration, and aggregation of data by autonomous software agents. Since real-world ontologies rapidly increase in size, it has become highly important to measure, evaluate and understand their complexity, in order to be able to control their maintenance and evolution.

SSQSA is a set of language-independent tools that statically analyze software systems in order to evaluate their quality attributes [Budimac et al. 2012]. The whole framework is organized around the enriched Concrete Syntax Tree (eCST) representation of source code [Rakić and Budimac 2011b]. The motivation for this work was to explore the possibility to use the eCST represen tation to compute metrics which reflect the complexity of ontological descriptions. In order to obtain the eCST representation of ontology, the SSQSA front-end has to be extended to support a language for the description of ontological systems. The aim of this paper is to explain how the SSQSA front-end is extended to support OWL2 language in functional-style syntax.

The rest of the paper is structured as follows. The next section presents the related work. Section 3 covers the integration of OWL2 into the SSQSA framework. In the next section are discussed the benefits of the eCST representation of ontology. The last section concludes the paper and gives directions for future work.

This work was partially supported by the Serbian Ministry of Education, Science and Technological Development through project Intelligent Techniques and Their Integration into Wide-Spectrum Decision Support, no. OI174023. The authors also would like to thank Rok Žontar for fruitful discussions on ontology metrics.

Author's address: M. Savić, Z. Budimac, G. Rakić, M. Ivanović, Department of Mathematics and Informatics, Faculty of Sciences, University of Novi Sad, Trg Dositeja Obradovića 4, 21000 Novi Sad, Serbia, email: {svc, zjb, goca, mira}@dmi.uns.ac.rs; M. Heričko, Institute of Informatics, Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova ulica 17, 2000 Maribor, Slovenia, email: marjan.hericko@uni-mb.si. 2. RELATED WORK 2.1 Ontology metrics In recent years, various metrics for measuring the complexity of ontological descriptions were proposed. Inspired by Chidamber and Kemerer [1994] metrics suite, Yao et al. [2005] proposed three cohesion metrics which are defined on a graph that represent subsumtion dependencies between ontological concepts. Orme at al. [2006] introduced three coupling metrics which are defined on the graph representation of ontology. Tartir et al. [2005] introduced OntoQA metric suite that contains 12 structural metrics also defined on ontological graph. Zhang et al. [2010] also proposed several new graph-based structural metrics for ontology evaluation. Their metrics suite, among others, contains metrics adopted from the Chidamber-Kemerer suite (NOC, DIT, CBO). Žontar and Heričko [2012] analyzed software metrics from the Lorenz-Kidd, Chidamber-Kemerer and Abreu metric suites in order to determine which of them can be adopted for ontologies. The results of their study show that graph-based software metrics can be adopted for ontology evaluation. 2.2 SSQSA Framework The SSQSA framework consists of two parts, SSQSA front-end also known as eCST Generator, and the set of SSQSA back-ends, individual tools that operate on the eCST representation of source code. The main characteristic of eCST representation is that it contains so called universal nodes, languageindependent markers that denote the meaning of concrete language constructs. The architecture of SSQSA is presented in Figure I. Also, it is shown how the architecture is planned to be extended with a new back-end in order support the analysis and evaluation of ontological systems.

SSQSA originated from a language-independent software metrics tool SMIILE [Rakić and Budimac 2011a]. SMIILE uses the eCST representation to calculate metrics reflecting internal complexity of software entities such as LOC and Cyclomatic complexity [McCabe 1976]. It is also integrated with Testovid, a semi -automated assessment system for students’ programs, in order to provide metric-based qualification of programming assignments [Pribela et al. 2012]. SSCA was the first SSQSA back-end which extended the applicability of the eCST representation [Gerlec et al. 2012]. This tool tracks and analyzes changes in the hierarchical structure of software entities and stores its results in a repository that also contain metric values obtained using SMIILE. The last realized SSQSA back-end is SNEIPL [Savić et al. 2012]. This tool extracts dependency networks formed by software entities that can be used to ` analyze the design complexity of software systems under the framework of complex network theory. Obtained networks can be also viewed as fact-bases required for reverse engineering activities and used to calculate metrics related to software design.

Currently SSQSA supports six general-purpose, imperative programming languages: Java, C#, Delphi, Modula-2, Pascal and Cobol. Therefore, this work is the first attempt to extend the SSQSA front end to produce the eCST representation of a declarative, domain-specific language. 3. INTEGRATION OF OWL2 LANGUAGE INTO THE SSQSA FRONT-END eCST Generator uses parsers generated by the ANTLR [Parr and Quong 1995] parser generator to produce the eCST representation of source code that is provided as input. The advantage of using ANTLR to describe languages supported by SSQSA is the ANTLR grammar notation itself. This notation enables modification of syntax trees through tree rewrite rules that are attached to grammar productions. Therefore, in order to integrate OWL2 into the SSQSA front-end the following steps have to be made: 1. Realization of ANTLR grammar which describes OWL2 FSS, 2. Identification of OWL2 language constructs that corresponds to existing eCST universal nodes, 3. Incorporation of eCST universal nodes into tree-rewrite rules of the grammar in order to obtain eCST representation of parsed text. 3.1 Step 1 – ANTLR grammar for OWL2 FSS The formal specification of OWL2 FSS in Extended Backus-Naur form (EBNF) can be found in the official W3C OWL2 language specification [Motik et al. 2012]. The ANTLR grammar notation closely follows EBNF, thus the grammar in [ Motik et al. 2012] can be easily adopted for ANTLR. At this stage of the integration, the realized grammar is tested using ten ontologies from TONES2 repository which are previously converted into OWL2 FSS using Protégé3. The results are summarized in Table I. It can be seen that the parser generated from the grammar successfully parsed more than 1.4 millions of lines of real-world ontological axioms in less than three minutes. FMA (Foundational Model of Anatomy) GEO Skills 144252 316101 233608 476111 182656 20506 46 25412 23707 5931

Parse time [s] 23 41 22 59 17 2 3.2 Step 2 – Universal nodes OWL2 FSS language contains four types of tokens: keywords, separators, identifiers and constants. For each of mentioned lexical categories there are already introduced eCST universal nodes. Ontological axioms are marked with STMT universal node which is used to mark individual statements in imperative 2 http://owl.cs.manchester.ac.uk/repository/ 3 http://protege.stanford.edu/ programming languages. Elements of an axiom are also marked with existing universal nodes (TYPE, ARGUMENT_LIST, and ARGUMENT). The PACKAGE_DECL universal node denotes that entities declared in an eCST sub-tree rooted at this node are mutually visible. Therefore, PACKAGE_DECL corresponds to the declaration of ontology. Declarations of ontological entities (concepts, roles and individuals) are marked with ATTRIBUTE_DECL universal node which is used to denote declarations of global variables in imperative programming languages. Ontological expressions that can be nested (class and data range expressions) are marked with the EXPR universal node.

OWL2 is a declarative, domain-specific language. Before the integration of OWL2, SSQSA supported several programming languages none of them being declarative or domain-specific. OWL2 axioms represent explicitly stated relations among ontological entities. Therefore, we introduced three new universal nodes that denote different categories of explicitly stated relations in general: 1. BINARY_RELATION (BR) marks binary relations 2. SYMMETRIC_RELATION (SR) marks symmetric n-ary relations 3. PARTIALLY_KNOWN_BINARY_RELATION (PKBR) marks binary relations in which one of the arguments is not known at the moment.

With BINARY_RELATION are marked all OWL2 relations that denote subsumptions and assertions. The SYMMETRIC_RELATION universal node is associated with relations indicating the equivalent and disjoints classes, same and different individuals, and equivalent and disjoint object properties. The PARTIALLY_KNOWN_BINARY_RELATION universal node marks object property domain and object property range relations. The newly introduced universal nodes are currently used only in the eCST representation of ontological descriptions. However, they can be used to mark explicitly stated binary and symmetric relations in other descriptive languages as well. Explicitly stated relations among entities in already supported imperative programming languages are marked with specific, more concrete universal nodes, such as EXTENDS and IMPLEMENTS. Those universal nodes can be viewed as sub-concepts of the BINARY_RELATION universal node. 3.3 Step 3 – Tree-rewrite rules Once the correspondence between constructs of a concrete language and eCST universal nodes is identified, it is pretty straightforward to incorporate universal nodes into tree rewrite rules of the grammar. For example, it has been identified that ontology declarations correspond to the PACKAGE_DECL universal node. Therefore, PACKAGE_DECL universal node will be incorporated in the tree rewrite rule of the production that describe ontology declaration as the following excerpt from the OWL2 FSS grammar shows: ontology : 'Ontology' '(' (ontologyIRI versionIRI?)? importo* annotation* axiom* ')' -> ^(PACKAGE_DECL ^(KEYWORD 'Ontology') ^(SEPARATOR '(') (ontologyIRI versionIRI?)? importo* annotation* axiom* ^(SEPARATOR ')') ); Besides the PACKAGE_DECL universal node, two other universal nodes are also incorporated in the rule: KEYWORD and SEPARATOR to mark keywords and separators in ontology declaration, respectively.

Figure II shows how a simple ontology named “PL” looks in the eCST representation. The complete description of the ontology in the functional-style syntax is as follows:

Ontology (:PL

SubClassOf(:C :CPP) ) The SubClassOf axiom states that each program written in the programming language C is at the same time valid C++ program. ` 4. BENEFITS OF OWL2 INTEGRATION INTO SSQSA Metrics that reflect complexity of a description written in a programming or formal language can be classified as follows: 1. Metrics of internal complexity reflect lexical and syntactical complexity of the description or some of its parts. Lexical complexity measures are derived from the lexical elements of a language and reflect the complexity that is related to the volume of the description. Representative metrics which belong to this category are LOC family of metrics and Halstead [1977] complexity measures. Syntactical complexity is related to the compositional (structural) complexity of concrete language constructs. Cyclomatic complexity is an example of widely used measure of syntactical complexity. 2. Metrics of design complexity reflect the complexity of dependency structures among identifiers introduced in the description. Those metrics quantify inheritance, coupling and cohesion relationships among entities represented by the identifiers. Representative examples are CBO, NOC, DIT and LCOM metrics from the Chidamber-Kemerer metrics suite. 3. Hybrid metrics combine metrics of internal and design complexity. Examples are WMC and RFC from the Chidamber-Kemerer metrics suite, and the Henry-Kafura complexity [Henry and Kafura 1981].

As it can be seen from the review of related works on ontology metrics, the complexity of an ontological description is viewed as some measure of complexity of underlying graph representation. In other words, already introduced ontology metrics belong to the category of design complexity metrics. The integration of OWL2 into the SSQSA front-end provides the eCST representation of ontology. This representation can be used to define (or adopt) and compute metrics of internal complexity, which is not possible in the graph-based representation of ontology. For example, Halstead complexity metrics adopted for ontologies can be calculated in the same way as for software systems: by counting e CST universal nodes representing lexical categories. Similarly, the statement and expression level universal nodes can be used to derive syntactical complexity measures. Currently, it is possible to use the SMIILE back-end to obtain LOC and Halstead metrics for ontological descriptions. SMIILE also calculates cyclomatic complexity (CC) for software systems, but this metric cannot be adopted for ontology evaluation, since there are no OWL2 language elements that correspond to branch and loop statements. However, the predicate counting procedure used for the computation of the CC metric in SMIILE can be adopted to derive the complexity of nested OWL2 class and data range expressions (by counting EXPR universal nodes in eCST).

The ATTRIBUTE_DECL universal node can be used to recognize the declarations of ontologies, declarations of ontological concepts, roles and named individuals in represented ontological description. Relations among those entities can be identified by the analysis of eCST sub-trees rooted at BR, SR and PKBR universal nodes (see Section 3.2). This means that the graph representation of ontology can be extracted from the eCST representation of ontology. Therefore, an ontology metrics tool that is based on the eCST representation also can be able to compute metrics of design complexity. Finally, metrics of internal and metrics of design complexity can be combined to obtain hybrid complexity metrics. The extraction of the graph representation of ontology is fundamentally different problem that the extraction of software networks due to the structural difference between ontological and software entities. The hierarchy tree representation of an ontological description can be obtained using the SNEIPL back-end, but SNEIPL cannot be used to identify horizontal dependencies (dependencies between entities of the same type) among ontological concepts and individuals. Those ontological entities are structurally atomic, i.e. they are not composed out of other ontological entities. To the contrary, software entities (classes, functions, etc.) are not structurally atomic: the definition of a software entity A associates the name of A with a body that contains the structure of A. Horizontal dependencies between software entity A and other entities are contained in the body of A, while horizontal dependencies between ontological entities are independent of ontological declarations. Since the SNEIPL back-end cannot be used to obtain the graph representation of ontology, a new SSQSA back-end that computes graph-based ontological metrics will be developed (see Figure I). This back-end will reuse and adopt modules from SMIILE to compute metrics of internal complexity, as well as modules from SNEIPL to form the hierarchy tree representation which is the first step in the extraction of ontological graph (identification of ontological entities and vertical dependencies).

The SSCA back-end constructs and compares hierarchy trees of two consequent versions of a software system in order to determine changes in vertical dependencies (dependencies among entities at different levels of abstraction). The hierarchy tree representation of an ontological description can be obtained from the eCST representation in the same way as for software systems: it is entirely determined by the hierarchical structure of eCST universal nodes in concrete eCSTs. This means that this back-end can be applied for ontologies in order to identify which concepts and named individuals are added or removed in the next version of ontology, and to what extent. Finally, with the design and development of new SSQSA back-ends it will be investigated whether they can be applied to analyze both software and ontological systems. 5. CONCLUSION AND FUTURE WORK In this paper we described how the SSQSA front-end is extended to support OWL2 language in functionalstyle syntax. It is also shown that the eCST representation of ontologies can be used to compute metrics that reflect both internal and design complexity of ontological descriptions. Therefore, our future work will include the development of a SSQSA back-end, as shown in Figure I, that uses the eCST representation of ontology to compute metrics reflecting different aspects of complexity of ontological descriptions. In our future work we will also investigate whether recently introduced metrics of cognitive complexity of programs written in object-oriented languages [Misra et al. 2012] and metrics of complexity of web services descriptions [Basci and Misra 2012] can be adopted and used for ontology evaluation. Dilek Basci and Sanjay Misra. 2012 . Metric suite for maintainability of eXtensible Markup Language web services. IET Softw. 5, 3, 320-341.

Zoran Budimac, Gordana Rakić, and Miloš Savić. 2012 . SSQSA architecture. In Proceedings of the Fifth Balkan Conference in

Informatics (BCI '12). ACM Conf. Proc. 1479, 287-290.

Shyam R. Chidamber and Chris F. Kemerer. 1994. A Metrics Suite for Object Oriented Design. IEEE Trans. Softw. Eng. 20, 6, 476493. Črt Gerlec, Gordana Rakić, Zoran Budimac, and Marjan Heričko. 2012. A programming language independent framework for metrics-based software evolution and analysis. Computer Science and Information Systems 9, 3, 1155-1186. Thomas R. Gruber. 1993. A translation approach to portable ontology specifications. Knowl. Acquis. 5, 2, 199-220. Maurice H. Halstead. 1977. Elements of software science. Elsevier North-Holland, Amsterdam.

Sallie M. Henry and Dennis G. Ka fura. 1981. Software structure metrics based on information flow. IEEE Computer Society Trans.

Software Engineering 7, 5, 510-518. `

Mobile Device and Technology Characteristics’ Impact on Mobile Application Testing TINA SCHWEIGHOFER AND MARJAN HERIČKO, University of Maribor Mobile technologies have a significant impact on processes in ICT, including software development. Within mobile technologies a new type of software has emerged: mobile applications. Nowadays, the concept of mobile applications is widely known and the development of mobile applications is more and more widespread. One of the most important parts of mobile application development is mobile applications testing. The testing process has always been very important and crucial in the software development cycle, which is why testing constitutes an important aspect of software development. An appropriate testing procedure significantly increases the quality level of the developed product. With mobile application development testing, new challenges associated with mobile technologies and device characteristics, have arisen. Some examples of these challenges are: connectivity, convenience, touch screen technology, context awareness, supported devices, etc. It is important that we adequately address these challenges and perform an appropriate mobile application testing process, resulting in a high quality product without critical defects that could cause quality issues or the unwanted waste of human or financial resources. In this paper, we will present a mobile application testing process. We will indicate the important parts and especially emphasize the challenges related to mobile devices and technology features and properties.

General Terms: Mobile applications testing Additional Key Words and Phrases: testing, mobile applications, mobile technologies, quality 1. INTRODUCTION Mobile devices and mobile applications play an important role in our everyday lives. Nowadays we are surrounded by mobile technology and cannot imagine running personal or business errands without them. This has been confirmed by numerous pieces of research. According to Gartner, the worldwide sale of mobile phones in the third quarter of 2012 reached almost 428 million units. Within this number, smartphone sales represent almost 40 percent of total mobile phone sales [Gartner 2012]. A similar thing is happening in the area of mobile subscriptions. At the end of 2012, there were approximately 6.8 billion mobile subscribers in the world, which is equal to 96 percent of the world population. Currently, global mobile-cellular penetration rates are 96 percent. In Europe the number is higher, at 126 percent [ITU 2013].

Closely related to mobile devices are mobile applications. By the end of 2012, there were approximately 1.1 million mobile applications users. According to forecasts, the number will grow rapidly – by nearly 30 percent per annum - to reach 4.4 billion by the end of 2017 [Whitfield 2013a]. Applications generated $12 billion in revenue in 2012 and a total of 46 billion applications were downloaded [Portio Research 2012]. This number is also expected to grow: in 2013 smartphone and tablet users will download a further 82 billion applications [Whitfield 2013b]. Mobile applications are currently represented in almost every possible personal or business domain. Although games still constitute the largest category in most of the major application stores [Whitfield 2013b], mobile applications can be seen in just about every industry. Some examples include: retail, media, travel, education, healthcare, finance, social, business applications, collaboration and more [uTest 2012]. Some of these applications within a specific

domain use more or less sensitive user data. Users frequently allow access to personal data in the context of mobile devices and also enter a lot of personal information. In this context, the issue of users’ trust takes on an important role. It becomes important to provide quality mobile applications that are reliable and flawless [Hu and Neamtiu 2011]. Applications that are reliable and work flawlessly within expected functionalities can gain a user’s trust and, more importantly, keep it. Users also often have high Author's address: T. Schweighofer, Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, 2000 Maribor, Slovenia; email: tina.schweighofer@uni-mb.si; M. Heričko, Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, 2000 Maribor, Slovenia; email: marjan.hericko@uni-mb.si. expectations about the quality of mobile applications. Applications that crash and lose users’ personal data are not allowed [Bo et al. 2007]. One of most important mechanisms for providing reliable, flawless and quality mobile applications is an appropriate testing procedure. Testing during mobile application development is slightly different from testing procedures in traditional software and the process itself is also suited to the area of mobile applications and mobile technologies.

In this paper we will present a testing procedure for testing mobile applications. We will identify and describe specific characteristics for mobile devices, mobile applications and mobile technologies as a whole, which have a significant impact on the testing procedure. First, in Section 2, we will present the fundamentals of software testing and reveal some of the major differences between testing traditional and mobile software. We will also provide an introduction to mobile application testing. In Section 3, we will present some of the specific characteristics of mobile technologies that have an impact on testing and challenges in testing mobile applications. Everything will be cemented with a practical approach for mobile application testing procedures and gained experiences. In the Discussion, we will present the findings and results of our work. 2. FUNDAMENTALS OF MOBILE APPLICATION TESTING Mobile application development has specific characteristics that need to be addressed through the entire product’s life cycle. According to a recent study [Wasserman 2010], there are important software engineering research issues linked to mobile application development. Some of these issues include: potential interaction with other applications, handling available sensors, the development of native or hybrid mobile applications, different families of hardware and software mobile platforms, problems of security, an adjusted user interface and the problem of power consumption.

Testing process plays an important role in the life cycle of a software product, whether in mobile or traditional desktop application. Therefore, it is crucial to address abovementioned issues in related mobile testing procedures.

A lot of research has dealt with the fundamentals of software testing, therefore there are many available definitions of testing. To summarize one of the definitions: testing is an activity performed for the purpose of evaluating product quality, and for improving the product by identifying potential defects and problems. Software testing is composed of the dynamic verification of the program behavior on a finite set of test cases against the expected program behavior [Bourque and Dupuis 2004].

Testing is not just an activity that starts after the coding phase is finished and is used to detect failures. Software testing is a procedure that should be active through the entire product life cycle, from the development and maintenance process to actual product construction. Also, the planning phase for testing should occur early in the product requirements process and test plans must be systematically and continuously developed, as the development of a product proceeds. Currently it is considered that the right strategy for quality is one of prevention. It is much better to avoid problems than to correct them. Therefore, testing must be viewed as a procedure for checking if prevention was successful and for identifying faults in cases where prevention was not effective [Bourque and Dupuis 2004].

An important aspect that makes mobile testing different is the complexity of testing, a point made by the authors of the aforementioned study [Wasserman 2010]. A challenge that they mention is the diversity of different available mobile devices, for example Android devices and others related to testing native mobile applications. There are also many other challenges related to mobile application testing. We will describe these challenges in detail in the subsection below. 2.1

Mobile application as testing object If we want to properly understand the concept of mobile application testing, it is important that we understand what a mobile application is. We are all familiar with mobile applications, but what does the definition say? A mobile application is a type of software application designed to run on smart phones, tablets and other mobile devices and/or for taking in input information. Similarly, mobile applications in

Mobile Device and Technology Characteristics Impact on Mobile Application Testing • 13:105 the context of mobile computing is an application that runs on an electronic device that may move [Kirubakaran and Karthikeyani 2013].

The testing of mobile applications is an important and also very difficult task, according to various authors [Bo et al. 2007; She et al. 2009; Kirubakaran and Karthikeyani 2013; Franke and Weise 2011]. They all believe that testing mobile applications is a non-trivial process that takes a lot of time, effort and other resources. We have had the same experience with projects where we developed mobile applications for Android, iOS and BlackBerry. The experience is described in detail below in Section 3. As previously mentioned, as mobile applications become more and more complex and ubiquitous, users have higher and higher expectations with regard to mobile application quality. Users want an application that does not fail, lose data or harm the device’s operability, as well as applications that are secure, reliable and easy to use. If we conduct the testing procedure properly, possible defects embedded in the application can be detected and removed and this can lead to greater confidence in an application [Bo et al. 2007; She et al. 2009].

The challenges encountered during mobile application testing were mostly related to the different characteristics of mobile devices or mobile technologies, which has a direct influence on mobile applications and the conducted testing procedure. In the existing literature we found many different described characteristics. As noted by [Kirubakaran and Karthikeyani 2013; Franke and Weise 2011] these characteristics are: connectivity, convenience, user interface, supported devices, touch screens, new programming languages, resource constraints, context awareness and data persistence. The mentioned characteristics are presented in Figure 1.

Connectivity

Supported devices

Resource constraints Convenience

Touch screen

Context awareness User interface

Programming languages

Data persistence

Fig. 1. Characteristics of mobile devices and technologies with their impact on the testing procedure 3. CHALLENGES IN MOBILE APPLICATION TESTING As previously mentioned, during mobile application testing we came across different challenges. Different authors have already investigated some of the challenges that have a significant influence on the testing procedure. We came across the same characteristics that consequently represent challenges in testing mobile applications. As mentioned, we developed mobile applications for the operating systems Android, iOS and BlackBerry in the context of a research and development project. Mobile applications are a part of the larger project, which also include a web application. Within the development process, we also perform mobile application testing. The process of application testing is a complex process, but for the needs of this article we will show a simplified version. The simplified testing process can be seen in Figure 2. The process starts with the release of a version of the mobile application for a specific platform for testing purposes. The Quality Assurance team receives aversion and starts the process of testing based on the recorded test scenarios. If they find an irregularity, an error or an unreliable function, they report the problem to the web-based bug tracking system. Bugs are seen by the development team and later fixed. We have to point out that within our project, we also performed different types of test cycles. The most common was the weekly testing procedure. There is also testing for the purpose of the application’s release on the belonging market.

MVoebrisleioanprpelilecaatsieon version release

QA receives mobile application

Report problem into bug tracking system

Test scenarios

Testing process

Bug fixed by development team

The most important part of the testing process is the execution of test scenarios, where specific characteristics of mobile devices are revealed. In fact, they also play an important part in writing testing scenarios, where we have to shape each test scenario in a way that it will consider and verify a specific characteristic. When we started to write and later execute specific test scenarios, we reviewed existing literature from the area of mobile application testing. Specific characteristics identified in different works were taken into account within our own testing procedure. The nature of these characteristics, what existing literature says, and how we dealt with them is discussed below.

The first property we came across and has an impact on many different types of testing is connectivity. Mobile applications have to be designed with the awareness that they will be always online, because mobile devices are always logged on to a mobile network. Networks can vary according to speed, reliability and security. Especially slow and unreliable wireless networks are a common obstacle for mobile applications. The described property has to be considered in functional testing, where different networks and connectivity scenarios have to be performed, with an emphasis on popular networks. Connectivity also has an effect on performance, security and reliability testing [Kirubakaran and Karthikeyani 2013; uTest 2012]. In practice, we consider the characteristic connectivity in such a way that we test our applications in different networks. We also perform test scenarios to test different internet connections. We use different Wi-Fi networks and cellular networks by different operators and in different places, like buildings, city centers or in nature. For our application, connectivity is very important because functions in mobile applications are supplemented with web applications, so the application uses the function of synchronization very often.

Another important property according to other studies is the user interface, which is related to the characteristic of convenience. This property is important because user interfaces in development need to follow specific guidelines based on the different platforms for which they are being developed. Different platforms have their own rules and guidelines about how a specific user interface should look, so if a product in development is being developed for different platforms we have to strongly focus on a specific design. Regardless, different platforms still present a big challenge in terms of designing the best possible use of limited screen space, so that the design of the user interface takes greater importance in the development process. The user interface looks different based on the mobile device’s screen resolution and its dimensions. Some implications on testing are seen in the area of different devices that need to be used for testing procedure. It is recommended to test the user interface on as many different mobile device as possible. This is because different devices behave differently with the same application code [Hu and

Mobile Device and Technology Characteristics Impact on Mobile Application Testing • 13:107 Neamtiu 2011; Kirubakaran and Karthikeyani 2013; Wasserman 2010]. Within the development of mobile applications in our project, the developers followed specific rules and good practices for designing platform specific applications. These guidelines were also reviewed in the testing phase. We also developed our own Style guide document, which ensured that regardless of the platform, the application would look similar and reflect the fact that all applications are part of the same product family. With regard to the testing process, we tested the appearance on different mobile phones, with different resolution and different physical dimensions. We considered the minimal and optimal screen size, which was set within the Software requirements specification document.

Nowadays many different mobile devices are available. What is important is that applications work flawlessly on as many devices as possible. Supported devices represent one of the most difficult aspects of the testing process. Devices from different vendors have different software and hardware components. In particular, there are hundreds of different mobile devices that run the operating system Android, whereas the mentioned operating system has countless different versions. Different versions of operating systems are also a great challenge to cover within the testing process [Kirubakaran and Karthikeyani 2013]. Usually it is impossible to test every available device, so we group mobile devices in different categories, as proposed in [Kirubakaran and Karthikeyani 2013]. The focus of this challenge is on Android mobile devices. We tested our mobile applications on mobile devices from different vendors, with different hardware components and different versions of operating systems. We developed three groups: small, optimized and high quality mobile devices. The first group included mobile devices with a small screen size and low resources, while the last group included mobile devices with a high screen resolution and a lot of resources. Test scenarios were carried out on a few representatives of each group. However, iOS devices were a different story as there is not such a large variety of different mobile devices. The same testing strategy was used for testing the touch screens of mobile devices and their properties, which also represent an important challenge in mobile application testing. Touch screens are the main tool for inputting user data into a mobile application. An important aspect is the system response time to a touch, which depends on device resource utilization, and easily may become slow in some circumstances, such as in the case of a busy processor, a lack of memory or other problem. Thus, it is important to test the touch screen’s abilities under different circumstances [Kirubakaran and Karthikeyani 2013]. We tested touch screen capabilities under different circumstances, as proposed. We burdened the processor and available memory by running multiple applications simultaneously, for the purpose of testing the behavior of different touch screen on different devices.

As many authors agree, mobile devices are becoming more and more powerful, but their resources, like processor power, RAM, and resolution are still facing restrictions [Kirubakaran and Karthikeyani 2013; She et al. 2009; Franke and Weise 2011; Portio Research 2012]. This characteristic is closely linked to some of the previously mentioned characteristics, like supported devices and touch screens. As proposed in [Kirubakaran and Karthikeyani 2013] mobile device resources have to be continuously monitored, to see what a specific mobile device is capable of and to verify what actions are taken if a device runs out of resources. A very similar characteristic is data persistence, because mobile applications that run out of memory shut down running applications, so we have to make sure user data is stored and saved adequately [Franke and Weise 2011]. We also test these two characteristics within specified groups of mobile device testing. We try to overload a specific mobile device and test the behavior of a mobile application. We check if it stored data properly and of course where the breaking limit for the mobile application is.

A very important characteristic that has a significant impact on testing our mobile application is context awareness. A lot of mobile applications also rely on sensed data, provided by context providers that monitor the surroundings and connectivity of devices. All these provide an enormous amount of data, which vary depending on the user’s actions and the environment. It is important to test the application under a different environment and under any contextual input, if it is going to work correctly [Kirubakaran and Karthikeyani 2013]. Our application uses data provided by GPS sensors and via Bluetooth from heart rate sensors. We have to ensure that the data is provided correctly regardless of the mobile device and its operating systems. Different operating systems support different Bluetooth devices so we have to ensure that we test all available and supported devices properly.

The characteristic that is more involved in the developing process, but still part of the testing process, is related to new programming languages that are used for mobile application development. These programming languages were developed to support mobility, managing resource consumption and handling new GUIs [Kirubakaran and Karthikeyani 2013]. It is important that code during the development process is tested properly, according to the features and characteristics of programming languages. BO, J., XIANG, L. AND XIAOPENG, G., 2007. MobileTest: A Tool Supporting Automatic Black Box Test for Software on Smart Mobile

Devices. Second International Workshop on Automation of Software Test (AST ’07), pp.8–8.

BOURQUE, P. AND DUPUIS, R., 2004. Guide to the Software Engineering Body of Knowledge. Guide to the Software Engineering Body of Knowledge, 2004. SWEBOK.

FRANKE, D. AND WEISE, C., 2011. Providing a Software Quality Framework for Testing of Mobile Applications. Software Testing,

Verification and Validation (ICST), 2011 IEEE Fourth International Conference on, pp.431–434.

GARTNER, 2012. Gartner Says Worldwide Sales of Mobile Phones Declined 3 Percent in Third Quarter of 2012; Smartphone Sales

Increased 47 Percent.

HU, C. AND NEAMTIU, I., 2011. Automating GUI testing for Android applications. In Proceedings of the 6th International Workshop on

Automation of Software Test. New York, NY, USA: ACM, pp. 77–83.

ITU, 2013. The World in 2013 - ICT Facts and Figures.

KIRUBAKARAN, B. AND KARTHIKEYANI, V., 2013. Mobile application testing — Challenges and solution approach through automation.

2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering, pp.79–84.

PORTIO RESEARCH, 2012. Your Portio Research Mobile Factbook 2012.

SHE, S., SIVAPALAN, S. AND WARREN, I., 2009. Hermes: A Tool for Testing Mobile Device Applications. Software Engineering

Conference, 2009. ASWEC ’09. Australian, pp.121–130.

UTEST, 2012. The Essential Guide to Mobile App TestingNo Title.

WASSERMAN, A.I., 2010. Software engineering issues for mobile application development. Proceedings of the FSE/SDP workshop on

Future of software engineering research - FoSER ’10, p.397.

WHITFIELD, K., 2013a. Fast growth of apps user base in booming Asia Pacific market. Portio Research.

WHITFIELD, K., 2013b. What apps are people using? Portio Research.

Zhang , Y. ( 2008 ). Literature Review and Survey: XML Schema Metrics .

Wes

Rishel . ( 2011 ). Does XML Schema Earn its Keep? The Gartner Blog Network . http://blogs.gartner.com/wes_rishel/ 2011 /12/31/okxml-schema -does-earn-its-keep-in-hl7/

Sušnik , M. ( 2008 ). V slogi je e-račun! Monitr Pro , http://www.monitorpro.si/41040/praksa/v-slogi - je- e-racun/.

Standard

ISO

/IEC 9126 Software engineering

McDowell , A. , Schmidt , C. , Yue , K. ( 2004 ). Analysis and Metrics of XML Schema . Proceedings of the International Conference on Software Engineering Research and Practice , SERP' 04 , v 2, p 538 - 544 , 2004 .

Burris , E. ( 2012 ), Hierarchical Nature of Software Quality, Programming in the Large, The Practice of Software Engineering , http://programminglarge.com/hierarchical -nature-of-software-quality/.

Narasimhan , V.L. , Hendradjaya , B. ( 2007 ). Some theoretical considerations for a suite of metrics for the integration of software components . Information Sciences , Volume 177 , Issue

, 1 February 2007 , Pages 844 - 864 . http://dx.doi.org/10.1016/j.ins. 2006 . 07 .010

Washizaki , H. , Fukazawab , Y. ( 2005 ). A technique for automatic component extraction from object-oriented programs by refactoring . Volume 56 , Issues 1- 2 , April 2005 , Pages 99 - 116 . http://dx.doi.org/10.1016/j.scico. 2004 . 11 .007

Thomas J.

McCabe .

A Complexity

Measure . 1976 . IEEE Trans. Software Eng . 2 ( 4 ): 308 - 320 .

Sanjay

Misra , Murat Koyuncu, Marco Crasso, Cristian Mateos and

Alejandro

Zunino . 2012 . A Suite of Cognitive Complexity Metrics . In Computational Science and Its Application ICCSA 2012. Lecture Notes in Computer Science , Vol. 7336 Springer Berlin Heidelberg, 234 - 237 .

Boris

Motik , Peter F. Patel-Schneider and Bijan Parsia . 2012 . OWL2 web ontology language structural specification and functionalstyle syntax (second edition) . Retrieved July , 2013 from http://www.w3.org/TR/owl2-syntax/

Anthony M. Orme , Haining Yao, and Letha

Etzkorn . 2006 . Coupling Metrics for Ontology-Based Systems . IEEE Softw. 23 , 2 , 102 - 108 .

Terence J.

Parr and

Russell W.

Quong . 1995 . ANTLR: a predicated-LL(k) parser generator . Softw. Pract. Exper . 25 , 7 , 789 - 810 .

Ivan

Pribela , Gordana Rakić, and

Zoran

Budimac . 2012 . First Experiences in Using Software Metrics in Automated Assesssment . In Proc. of the 15th International Multiconference on Information Society (IS) , Collaboration, Software and Services in Information Society (CSS) , Vol. A , 250 - 253 .

Gordana

Rakić and

Zoran

Budimac . SMIILE Prototype. 2011a. In Proc. Of International Conference of Numerical Analysis and Applied Mathematics ICNAAM2011, Symposium on Computer Languages, Implementations and Tools (SCLIT) , AIP Conf. Proc. 1389 , 853 - 856 .

Gordana

Rakić and

Zoran

Budimac . Introducing Enriched Concrete Syntax Trees. 2011b. In Proc. of the 14th International Multiconference on Information Society (IS) , Collaboration, Software and Services in Information Society (CSS) , Vol. A , 231 - 234 .

Miloš

Savić , Gordana Rakić, Zoran Budimac, and

Mirjana

Ivanović . 2012 . Extractor of software networks from enriched concrete syntax trees . In Proc. Of International Conference of Numerical Analysis and Applied Mathematics ICNAAM2012, Symposium on Computer Languages, Implementations and Tools (SCLIT) , AIP Conf. Proc. 1479 , 486 - 489 .

Samir

Tartir , Budak I. Arpinar , Michael Moore,

Amith P.

Sheth , and Boanerges Aleman -Meza. 2005 . OntoQA: Metric-based ontology quality analysis . In Proceedings of IEEE Workshop on Knowledge Acquisition from Distributed , Autonomous, Semantically Heterogeneous Data and

Knowledge

Sources .

Hongyu

Zhang , Yuan-Fang Li , and Hee Beng Kuan Tan. 2010 . Measuring design complexity of semantic web ontologies . J. Syst. Softw . 83 , 5 , 803 - 814 .

Rok

Žontar and

Marjan

Heričko . 2012 . Adoption of object-oriented software metrics for ontology evaluation . In Proceedings of the Fifth Balkan Conference in Informatics (BCI '12) . ACM Conf. Proc. 1479 , 298 - 301 .