Decision Support Architecture for Primary Studies Evaluation Vilmar Nepomuceno Informatics Center (CIn) Federal University of Pernambuco Recife - PE, Brazil vsn@cin.ufpe.br ABSTRACT There are studies that talk about automatic selection of primary Background. A systematic literature review is a process in which studies using text mining techniques, in which the primary studies all relevant available research about a research question is are classified by the similarity of texts [5] [6]. However, in these identified, evaluated, and interpreted through individual studies. studies, the conductor does not assist the process, which can The workload required for this process may bias the evaluation of generate a set of papers that is significantly different than would the studies, affecting the result. Aim. Creating a decision support be generated by the manual process. Besides, according to architecture to assist participants of a systematic review in the Kitchenham et al. [7], the process of quality assessment for selection process of the individual studies and quality assessment primary studies is essential for SLRs. Our research seeks (i) to of these studies, possibly improving the execution time and identify a way to semi-automatically support the primary studies reducing the evaluation bias. Method. Improving the primary quality assessment, with the use of text mining techniques and studies selection and quality assessment processes by using text ontologies to describe prior knowledge about the SLR, as well as mining techniques and ontologies to construct a decision support (ii) identifies, semi-automatically, the criteria for inclusion/ architecture. We will also conduct experiments to evaluate the exclusion inside the text of the studies returned by the search, also through text mining techniques, supporting the selection process proposed architecture. Contribution. Improve the primary studies of the primary studies. selection and quality assessment processes, reducing its workload, and lowering the evaluation bias in systematic literature reviews. The semi-automatic evaluation of the quality and the inclusion/exclusion criteria semi-automatic search should support Keywords the evaluation of primary studies in SLR process. With this, we Systematic Literature Review, Text Mining, Ontology. aim to reduce the execution time of the SLR, due to the primary study reading procedure, decreasing the primary studies 1. INTRODUCTION evaluation bias that may occur due to the evaluation process subjectivity, increasing the assurance that the outcome of the SLR The systematic literature review (SLR) is a process in which all is not being compromised [8]. In addition, you can increase the relevant available research about a research question, or area, or studies search space, which today is limited because of the effort phenomenon of interest is identified, evaluated, and interpreted spent during the selection process and subsequent quality through their individual studies, called primary studies during the evaluation of these studies. SLR process. A guide to lead the SLR in software engineering was proposed by Kitchenham and Charters [1] and summarizes the systematic review process into three main phases: planning the 2. RELATED WORKS review, conducting the review and reporting the review results. 2.1 Quality Assessment The SLR planning phase generates a protocol that defines how to There are several guidelines available for quality assessment of conduct the review. Once started, the process of conducting the primary studies, such as [9] which put forward eleven evaluation review needs to define a search strategy, which is responsible for criteria based on CASP [10]. In Kitchenham et al. [11] was used a finding primary studies available, and, once obtained the studies checklist for quality assessment, in order to specify an appropriate in potential, it is necessary to perform the selection from these process for evaluating quality. The study concluded that at least studies through criteria that were defined in the SLR protocol. two evaluators are required to improve the reliability and the The criteria for studies inclusion and exclusion must take into quality assessment should be represented by the sum of the account the research question already defined. criteria used. However, Dieste et al. [12] identified trends that The process of primary studies selection is a free interpretation of should not exist a plethora of items into instruments of quality the criteria of whom is leading the SLR, hence, the large number control on SLRs, and that this assessment should be careful about of papers retrieved during the search process and the poor quality the limits of this process with respect to aspects of internal of their abstracts [2] makes the completion of the selection validity [11]. process a hard task and sometimes inaccurate. After the selection There is still not a standard process for primary studies quality process is carried out to evaluate the quality of selected papers to assessment in software engineering, several authors have increase the reliability and importance of SLR results, and to suggested several ways to estimate the validity of the studies. perform this task, there are several guidelines, which are usually Probably due to this lack of standardization, in our research were not properly followed and the use, in general, is not justified by not found tools that automate the process, or part of this process. the authors [3]. These two processes of primary studies evaluation, study selection and quality assessment are hard tasks 2.2 Primary Studies Selection and time consuming [4], much of this time due to the primary The selection of primary studies is the most time-consuming task study reading procedure. for an SLR, which can be affected by the titles and abstracts that Copyright © 2015 for this paper by its authors. Copying permitted for private and academic purposes 10 do not reflect well the content of the work [4]. Additionally, time an ontology. It is still undefined in what form the creation of the constraints may lead the research conductors to reduce the search ontology will be held, but there are free available tools such as space. Hence, automate the process, or part of it can help protégé [14] and Jena [15] and the GATE (General Architecture overcome these barriers. for Text Engineering) framework [16], that was already used in Automatic classification of primary studies, indicating its [17] for this purpose. inclusion/exclusion can be found in some works [5] [6], in other These criteria should be used as input to a text mining algorithm, words, in these works, the study selection process is done in an which should be available as a tool component (DSTool) and will automated way without the intervention of the researcher. With be able to identify whether the selection criteria have been met this approach can dramatically reduce the effort spent on that task, and which quality criteria can be identified inside the primary however, methods that do this kind of selection may include study text. The interest points of the text, where the tool will be studies that did not help in the research, because of their low based to respond, will be shown to the research conductor, which quality, or the algorithm low accuracy, as well as, they may will decide whether the criteria were actually achieved. The exclude studies with low textual similarity, but with good quality, inclusion/exclusion criteria will be provided by the conductor in which could be used by the researcher in some way. the protocol creation, and the quality criteria will come from one Therefore, this research proposes a semi-automated approach, of the guidelines available in the literature, after to perform a leaving to the researcher the final decision of the inclusion or not systematic literature review that should indicate the best guideline of these studies. It is likely that when comparing our methodology to be implemented by the text mining algorithm. The main idea to and the automatic classifiers, there will be a loss of time taken to provide a previous guideline is to improve the robustness and perform the task by using the technique proposed in this work. accuracy of the algorithm, which does not prevent that other However, it is possible that the final result of the selection criteria, which are not present in the selected guideline, may be process, using our proposal, is more satisfactory and significantly used. Another input to the text mining algorithm is the ontology faster than the manual approach. built upon the protocol (3), many text mining techniques use ontologies as a knowledge base, one of which is the ontology- based question answering system [18], which It is the starting 3. PROPOSAL point in the tool development. Still will be evaluated, ways to According to the aim of this study, our guiding research question present the results from the tool (4). (RQ) is “How can we improve the process of evaluating primary studies by automating parts of the process?”. This question can 3.1 Decision Support Algorithm be decomposed into: Based on the system architecture proposed by Bo and Yunqing  RQ1. How can we improve the selection process by [18], a question answer architecture is being proposed to the automating parts of the process? decision support algorithm (Figure 2).  RQ2. How can we improve the quality assessment process by automating parts of the process? An initial version of the proposed Decision Support Architecture (DS Architecture) is presented in Figure 1. Figure 2. Proposed Algorithm Architecture. The questions that will serve as input to the algorithm will be provided by the protocol, at this time the criteria for inclusion/exclusion and the quality criteria will be provided one by one to the segmentation process (1). At this point, the question will be broken in terms, and the keywords will be extracted (2). As, for example, the following quality criteria: “Is there a clear Figure 1. Proposed DS Architecture. statement (definition) of the aims (goals, purposes, problems, The knowledge about the SLR will be represented by an ontology motivations, objectives, questions) of the research?” [3], where (1) based on the protocol designed for the execution of the SLR we can draw the following set of terms T = {statement, definition, and possible research conductor's refinements. This approach is goals, purposes, problems, motivations, objectives, questions, proposed by Biolchini et al [13], that state it is possible to research}. The keywords found in the criteria will be expanded improve the results obtained during the SLR through with the help of the ontology (3). Thus, we can create a larger set standardization of the terminology for the concepts involved using 11 of terms that should improve the algorithm accuracy. These three A training will be conducted with the participants to present the points can be called as the search query creation process. reviewer tool and how one should analyze the DSArch. For this After creating the search query, we look for the answers in the training some primary studies, chosen from a subset of the primary studies (4), at this point we use text mining algorithms to primary studies set obtained from the SLR search process, will be find possible answers to the criteria and return the points in the selected and participants should evaluate them using the factor text where the answers can be found, as well as the text of the alternatives. response (5). 4.3 Task and Data 4. EVALUATION PROPOSAL To measure the dependent variables two tasks will be realized, To evaluate the effectiveness of the Decision Support Architecture which requires no prior knowledge on SLR from the subjects, (DSArch) a controlled experiment [19] will be performed. The hence, we aim to facilitate the process of selection of these study aims to answer the following research questions: participants. The tasks to be performed in the experiment are the selection of primary studies from a subset of the primary studies  RQ1. Does DSArch decrease the selection time of set obtained from the SLR search process and the quality primary studies? assessment of these selected studies. No other SLR procedures  RQ2. Does DSArch increase consensus between will be performed. individuals of the same pair including/excluding At the end of the execution, the participants must provide a list of studies? accepted primary studies, the time taken for completion of the  RQ3. Does DSArch decrease the quality assessment selection process and the time taken for completion of the quality time of the selected primary studies? assessment process. A questionnaire will be performed after the 4.1 Hypotheses, Variables, and Parameters experiment to evaluate the experiment itself and what the The null hypotheses are presented below: participants thought about the proposed architecture.  H0,RQ1. There is no difference in the execution time in 4.4 Execution the selection process with or without the use of DSArch. The experiment will take place in a lab with the presence of all  H0,RQ2. There is no difference in the consensus among participants at the same time. Participants will be divided into peers with or without the use of DSArch. pairs following the proposed by Kitchenham et al [7], such choice  H0,RQ3. There is no difference in the quality assessment will be at random. time of the selected primary studies with or without the Table 2 shows the proposed design of the experiment: use of DSArch. To examine the hypotheses the following dependent variables will Table 2. Experimental Design be used: Pairs/EU EU1 EU2  Selection Time. The time to complete the process for selecting primary studies. P1 A B  Consensus. Measured by inter-rater agreement P2 B A coefficient. Cohen´s Kappa will be used, due to the decision be taken by pairs. Where P1 and P2 are two pairs of participants, EU1 and EU2 are  Quality Time. The quality assessment time of the the experimental units, each is one subset of primary studies selected primary studies. obtained from the SLR proposed in the experiment, the subsets The experiment factor is the process for evaluating the primary will be chosen at random, as well as, A and B that are the applied studies, by selecting these studies and quality assessment of the treatments. This design is being proposed to facilitate internal selected primary studies. The factor alternatives (treatments) are replication of the experiment. executing the inclusion/exclusion process and quality assessment To collect the data the participants should submit the list with the using DSArch and the other is using only the protocol criteria selected primary studies, the time taken to complete the selection manually, without tool support for the alternatives. process and the time taken to complete the quality evaluation of the selected studies. 4.2 Material, Tool, and Training 4.5 Data Analysis To perform the experiment a subset of primary studies drawn from Latin squares have an analysis procedure very similar to the an SLR will be provided to participants. The protocol created for factorial experiments (multiple factors). In the case of factorial the SLR will be provided and both the search process and the experiments, we can consider the lock variable as a factor. For the visualization of the primary studies will be conducted by Latin square, we have two lock variables and a factor of interest. REviewER1. It belongs to our research group and it is a tool that Also, there is another complicating factor: the experiment design gives support to search process in some databases (ACM library, has multiple replications. This leads to a replicated Latin square Engineering Village, IEEExplorer, Science Direct, Scopus, and design with equal columns (processes) and different lines Springer Link) in a automated way and gives support to the (participants). primary studies selection process. A version of REviewER will be developed, containing the DS Architecture, to be used in the A possible statistical test for the analysis is the ANOVA, as evaluation. However, only the DSArchictecture should be proposed in [19], however, for being a parametric test, some assessed. preconditions should be evaluated, and if one of them is violated a equivalent non-parametric test can be used. 1 http://sites.google.com/site/eseportal/tools/reviewer 12 4.6 Threats to Validity Journal of Systems and Software, vol. 80 (4), pp. 571-583, Some possible threats to validity of the experiment are already 2007. DOI: 10.1016/j.jss.2006.07.009. being assessed. [3] Zhou Y., Zhang H., Huang X., Yang S., Babar M. A., and Tang H.. 2015. Quality assessment of systematic reviews in 4.6.1 Internal Validity software engineering: a tertiary study. In Proceedings of the The completion of the training can generate an apprenticeship in 19th International Conference on Evaluation and Assessment relation to DSArch, which may influence the evaluation process in Software Engineering (EASE '15). ACM, New York, NY, of the primary studies when participants are not using the USA, Article 14 , 14 pages. DSArch. However, if there is influence, it will be in favor of the [4] Carver J. C., Hassler E., Hernandes E., and Kraft N. A.. null hypothesis. Identifying barriers to the systematic literature review process. In Empirical Software Engineering and 4.6.2 External Validity Measurement, 2013 ACM/IEEE International Symposium The results may not be generalizable to all the researches that on, pages 203-212. IEEE, 2013. perform SLRs, because we do not sample from the population of [5] Cohen A. M., Hersh W. R., Peterson K., and Yen P. Y.. SLRs researchers, but we intend to make a satisfactory outcome Reducing workload in systematic review preparation using that will bring evidence of the effectiveness of DSArch. automated citation classification. Journal of the American Medical Informatics Association: JAMIA, 13(2):206–219, 5. CONTRIBUTIONS, FUTURE WORK 2006. ISSN 1067-5027. AND ADVICES [6] Tomassetti F., Rizzo G., Vetro A., Ardito L., Torchiano M., The use of the DSArch might reduce the effort to complete SLR, and Morisio M. (2011). Linked Data approach for selection which is one of the major problems encountered in conducting process automation in Systematic Reviews. In Proceedings of this type of research. Another problem to be addressed by using the 15th Annual Conference on EASE, pp. 31 – 35. DSArch is the primary studies subjective selection, which is [7] Kitchenham B., Brereton P., Turner M., Niazi M., Linkman biased, by automating part of the process. Another gain by S., Pretorius R., Budgen D., Refining the systematic automating part of the process is decreasing the number of literature review process – two observer participant case conflicts generated during the evaluation of studies within the studies, Empirical Software Engineering 15 (6) (2010) 619– pairs. An expectation for the proposed algorithm is that it can be 653. used to assist in the SLR data analysis process, since, the question [8] Kitchenham B., Sjøberg D. I., Brereton O. P., Budgen D., answer principle can also be used at this stage, however, this Dybå T., Höst M., and Runeson P.. Trends in the quality of analysis is out of the context of this thesis. human-intensive software engineering experiments: a quasi- experiment. ieee, 2013. So far, it was conducted an ad-hoc literature review on the proposed theme, as well a first design of the controlled [9] Dyba T., Dingsøyr T.. Strength of evidence in systematic experiment that will be conducted to evaluate the DSArch. The reviews in software engineering. Proceedings of the Second planned next steps are: ACM-IEEE international symposium on Empirical software engineering and measurement. 2008, p. 178-187.  Execute a systematic literature review of techniques for [10] P. H. R. U. in Oxford. Critical appraisal skills programme. assessing the quality of primary studies, aiming at http://www.casp-uk.net/, 2013. choosing a technique to be semi-automated; [11] Kitchenham B., Sjøberg D. I., Brereton O. P., Budgen D.,  Select and implement an ontology to represent the prior Dybå T., Höst M., and Runeson P.. Can we evaluate the knowledge on the SLR; quality of software engineering experiments? In Proceedings  Development of the DSArch; of the 2010 ACM-IEEE International Symposium on  Refine the plan and execute the controlled experiment Empirical Software Engineering and Measurement, ACM, to assess the DSArch; 2010.  Analyze the obtained results; [12] Dieste O., Grimán A., Juristo N., and Saxena H..  Write the Thesis. “Quantitative Determination of the Relationship between The main points where advices are needed: Internal Validity and Bias in Software Engineering:  The proposed architecture is consistent with the Consequences for Systematic Literature Reviews,” Proc. Int’l problem found? Symp. Empirical Software Eng. and Metrics, pp. 285-288,  The experimental design can evaluate the proposed 2011. architecture? [13] Biolchini J., Mian P., Natali A., Conte T., Travassos G..  The statistical test, ANOVA can evaluate the data Scientific research ontology to support systematic review in generated by the experiment? software engineering. Advanced Engineering Informatics, vol. 21 (2), pp. 133-151, 2007. DOI: 6. REFERENCES 10.1016/j.aei.2006.11.006. [1] Kitchenham B. A. and Charters S.. Guidelines for performing [14] Knaublock, H.. Protégé-OWL. 2003. Avaiable at systematic literature reviews in software engineering, http://protege.stanford.edu. Accessed on 07/03/2015. Technical Report: 2007. [15] Labs, H.. Jena: A free and open source Java framework for [2] Brereton P., Kitchenham B. A., Budgen D., Turner M. and building Semantic Web and Linked Data applications. 2010. Khalil M.. Lessons from applying the systematic literature Avaiable at http://jena.apache.org. Accessed on 07/03/2015. review process within the software engineering domain. 13 [16] Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: [18] Bo W. and Yunqing L., "Research on the Design of the GATE: A framework and graphical development Ontology-Based Automatic Question Answering environment for robust NLP tools and applications. In: Proc. System," Computer Science and Software Engineering, 2008 of the 40th Anniversary Meeting of the ACL. (2002). International Conference on , vol.5, no., pp.871,874, 12-14 [17] Witte R., Li Q., Zhang Y. and Rilling J., Ontological Text Dec. 2008. Mining of Software Documents, 12th International [19] Juristo N., Moreno A. M., Basics of Software Engineering Conference on Applications of Natural Language to Experimentation, Springer Publishing Company, Information Systems (NLDB 2007, Paris, France, June 27- Incorporated, 2010. 29, 2007. 14