Decision Support Architecture for
                                      Primary Studies Evaluation
                                                        Vilmar Nepomuceno
                                                      Informatics Center (CIn)
                                                  Federal University of Pernambuco
                                                         Recife - PE, Brazil
                                                           vsn@cin.ufpe.br
ABSTRACT                                                                    There are studies that talk about automatic selection of primary
Background. A systematic literature review is a process in which            studies using text mining techniques, in which the primary studies
all relevant available research about a research question is                are classified by the similarity of texts [5] [6]. However, in these
identified, evaluated, and interpreted through individual studies.          studies, the conductor does not assist the process, which can
The workload required for this process may bias the evaluation of           generate a set of papers that is significantly different than would
the studies, affecting the result. Aim. Creating a decision support         be generated by the manual process. Besides, according to
architecture to assist participants of a systematic review in the           Kitchenham et al. [7], the process of quality assessment for
selection process of the individual studies and quality assessment          primary studies is essential for SLRs. Our research seeks (i) to
of these studies, possibly improving the execution time and                 identify a way to semi-automatically support the primary studies
reducing the evaluation bias. Method. Improving the primary                 quality assessment, with the use of text mining techniques and
studies selection and quality assessment processes by using text            ontologies to describe prior knowledge about the SLR, as well as
mining techniques and ontologies to construct a decision support            (ii) identifies, semi-automatically, the criteria for inclusion/
architecture. We will also conduct experiments to evaluate the              exclusion inside the text of the studies returned by the search, also
                                                                            through text mining techniques, supporting the selection process
proposed architecture. Contribution. Improve the primary studies
                                                                            of the primary studies.
selection and quality assessment processes, reducing its workload,
and lowering the evaluation bias in systematic literature reviews.          The semi-automatic evaluation of the quality and the
                                                                            inclusion/exclusion criteria semi-automatic search should support
Keywords                                                                    the evaluation of primary studies in SLR process. With this, we
Systematic Literature Review, Text Mining, Ontology.                        aim to reduce the execution time of the SLR, due to the primary
                                                                            study reading procedure, decreasing the primary studies
1. INTRODUCTION                                                             evaluation bias that may occur due to the evaluation process
                                                                            subjectivity, increasing the assurance that the outcome of the SLR
The systematic literature review (SLR) is a process in which all
                                                                            is not being compromised [8]. In addition, you can increase the
relevant available research about a research question, or area, or
                                                                            studies search space, which today is limited because of the effort
phenomenon of interest is identified, evaluated, and interpreted
                                                                            spent during the selection process and subsequent quality
through their individual studies, called primary studies during the
                                                                            evaluation of these studies.
SLR process. A guide to lead the SLR in software engineering
was proposed by Kitchenham and Charters [1] and summarizes
the systematic review process into three main phases: planning the          2. RELATED WORKS
review, conducting the review and reporting the review results.             2.1 Quality Assessment
The SLR planning phase generates a protocol that defines how to             There are several guidelines available for quality assessment of
conduct the review. Once started, the process of conducting the             primary studies, such as [9] which put forward eleven evaluation
review needs to define a search strategy, which is responsible for          criteria based on CASP [10]. In Kitchenham et al. [11] was used a
finding primary studies available, and, once obtained the studies           checklist for quality assessment, in order to specify an appropriate
in potential, it is necessary to perform the selection from these           process for evaluating quality. The study concluded that at least
studies through criteria that were defined in the SLR protocol.             two evaluators are required to improve the reliability and the
The criteria for studies inclusion and exclusion must take into             quality assessment should be represented by the sum of the
account the research question already defined.                              criteria used. However, Dieste et al. [12] identified trends that
The process of primary studies selection is a free interpretation of        should not exist a plethora of items into instruments of quality
the criteria of whom is leading the SLR, hence, the large number            control on SLRs, and that this assessment should be careful about
of papers retrieved during the search process and the poor quality          the limits of this process with respect to aspects of internal
of their abstracts [2] makes the completion of the selection                validity [11].
process a hard task and sometimes inaccurate. After the selection           There is still not a standard process for primary studies quality
process is carried out to evaluate the quality of selected papers to        assessment in software engineering, several authors have
increase the reliability and importance of SLR results, and to              suggested several ways to estimate the validity of the studies.
perform this task, there are several guidelines, which are usually          Probably due to this lack of standardization, in our research were
not properly followed and the use, in general, is not justified by          not found tools that automate the process, or part of this process.
the authors [3]. These two processes of primary studies
evaluation, study selection and quality assessment are hard tasks           2.2 Primary Studies Selection
and time consuming [4], much of this time due to the primary                The selection of primary studies is the most time-consuming task
study reading procedure.                                                    for an SLR, which can be affected by the titles and abstracts that

                   Copyright © 2015 for this paper by its authors. Copying permitted for private and academic purposes


                                                                       10
do not reflect well the content of the work [4]. Additionally, time         an ontology. It is still undefined in what form the creation of the
constraints may lead the research conductors to reduce the search           ontology will be held, but there are free available tools such as
space. Hence, automate the process, or part of it can help                  protégé [14] and Jena [15] and the GATE (General Architecture
overcome these barriers.                                                    for Text Engineering) framework [16], that was already used in
Automatic classification of primary studies, indicating its                 [17] for this purpose.
inclusion/exclusion can be found in some works [5] [6], in other            These criteria should be used as input to a text mining algorithm,
words, in these works, the study selection process is done in an            which should be available as a tool component (DSTool) and will
automated way without the intervention of the researcher. With              be able to identify whether the selection criteria have been met
this approach can dramatically reduce the effort spent on that task,        and which quality criteria can be identified inside the primary
however, methods that do this kind of selection may include                 study text. The interest points of the text, where the tool will be
studies that did not help in the research, because of their low             based to respond, will be shown to the research conductor, which
quality, or the algorithm low accuracy, as well as, they may                will decide whether the criteria were actually achieved. The
exclude studies with low textual similarity, but with good quality,         inclusion/exclusion criteria will be provided by the conductor in
which could be used by the researcher in some way.                          the protocol creation, and the quality criteria will come from one
Therefore, this research proposes a semi-automated approach,                of the guidelines available in the literature, after to perform a
leaving to the researcher the final decision of the inclusion or not        systematic literature review that should indicate the best guideline
of these studies. It is likely that when comparing our methodology          to be implemented by the text mining algorithm. The main idea to
and the automatic classifiers, there will be a loss of time taken to        provide a previous guideline is to improve the robustness and
perform the task by using the technique proposed in this work.              accuracy of the algorithm, which does not prevent that other
However, it is possible that the final result of the selection              criteria, which are not present in the selected guideline, may be
process, using our proposal, is more satisfactory and significantly         used. Another input to the text mining algorithm is the ontology
faster than the manual approach.                                            built upon the protocol (3), many text mining techniques use
                                                                            ontologies as a knowledge base, one of which is the ontology-
                                                                            based question answering system [18], which It is the starting
3. PROPOSAL                                                                 point in the tool development. Still will be evaluated, ways to
According to the aim of this study, our guiding research question           present the results from the tool (4).
(RQ) is “How can we improve the process of evaluating primary
studies by automating parts of the process?”. This question can             3.1 Decision Support Algorithm
be decomposed into:                                                         Based on the system architecture proposed by Bo and Yunqing
         RQ1. How can we improve the selection process by                  [18], a question answer architecture is being proposed to the
          automating parts of the process?                                  decision support algorithm (Figure 2).
         RQ2. How can we improve the quality assessment
          process by automating parts of the process?
An initial version of the proposed Decision Support Architecture
(DS Architecture) is presented in Figure 1.


                                                                                      Figure 2. Proposed Algorithm Architecture.
                                                                            The questions that will serve as input to the algorithm will be
                                                                            provided by the protocol, at this time the criteria for
                                                                            inclusion/exclusion and the quality criteria will be provided one
                                                                            by one to the segmentation process (1). At this point, the question
                                                                            will be broken in terms, and the keywords will be extracted (2).
                                                                            As, for example, the following quality criteria: “Is there a clear
              Figure 1. Proposed DS Architecture.
                                                                            statement (definition) of the aims (goals, purposes, problems,
The knowledge about the SLR will be represented by an ontology              motivations, objectives, questions) of the research?” [3], where
(1) based on the protocol designed for the execution of the SLR             we can draw the following set of terms T = {statement, definition,
and possible research conductor's refinements. This approach is             goals, purposes, problems, motivations, objectives, questions,
proposed by Biolchini et al [13], that state it is possible to              research}. The keywords found in the criteria will be expanded
improve the results obtained during the SLR through                         with the help of the ontology (3). Thus, we can create a larger set
standardization of the terminology for the concepts involved using


                                                                       11
of terms that should improve the algorithm accuracy. These three           A training will be conducted with the participants to present the
points can be called as the search query creation process.                 reviewer tool and how one should analyze the DSArch. For this
After creating the search query, we look for the answers in the            training some primary studies, chosen from a subset of the
primary studies (4), at this point we use text mining algorithms to        primary studies set obtained from the SLR search process, will be
find possible answers to the criteria and return the points in the         selected and participants should evaluate them using the factor
text where the answers can be found, as well as the text of the            alternatives.
response (5).
                                                                           4.3 Task and Data
4. EVALUATION PROPOSAL                                                     To measure the dependent variables two tasks will be realized,
To evaluate the effectiveness of the Decision Support Architecture         which requires no prior knowledge on SLR from the subjects,
(DSArch) a controlled experiment [19] will be performed. The               hence, we aim to facilitate the process of selection of these
study aims to answer the following research questions:                     participants. The tasks to be performed in the experiment are the
                                                                           selection of primary studies from a subset of the primary studies
            RQ1. Does DSArch decrease the selection time of               set obtained from the SLR search process and the quality
             primary studies?                                              assessment of these selected studies. No other SLR procedures
            RQ2. Does DSArch increase consensus between                   will be performed.
             individuals of the same pair including/excluding              At the end of the execution, the participants must provide a list of
             studies?                                                      accepted primary studies, the time taken for completion of the
            RQ3. Does DSArch decrease the quality assessment              selection process and the time taken for completion of the quality
             time of the selected primary studies?                         assessment process. A questionnaire will be performed after the
4.1 Hypotheses, Variables, and Parameters                                  experiment to evaluate the experiment itself and what the
The null hypotheses are presented below:                                   participants thought about the proposed architecture.
         H0,RQ1. There is no difference in the execution time in          4.4 Execution
          the selection process with or without the use of DSArch.         The experiment will take place in a lab with the presence of all
         H0,RQ2. There is no difference in the consensus among            participants at the same time. Participants will be divided into
          peers with or without the use of DSArch.                         pairs following the proposed by Kitchenham et al [7], such choice
         H0,RQ3. There is no difference in the quality assessment         will be at random.
          time of the selected primary studies with or without the         Table 2 shows the proposed design of the experiment:
          use of DSArch.
To examine the hypotheses the following dependent variables will                             Table 2. Experimental Design
be used:                                                                          Pairs/EU             EU1                  EU2
         Selection Time. The time to complete the process for
          selecting primary studies.                                                  P1                 A                    B
         Consensus. Measured by inter-rater agreement                                P2                 B                    A
          coefficient. Cohen´s Kappa will be used, due to the
          decision be taken by pairs.                                      Where P1 and P2 are two pairs of participants, EU1 and EU2 are
         Quality Time. The quality assessment time of the                 the experimental units, each is one subset of primary studies
          selected primary studies.                                        obtained from the SLR proposed in the experiment, the subsets
The experiment factor is the process for evaluating the primary            will be chosen at random, as well as, A and B that are the applied
studies, by selecting these studies and quality assessment of the          treatments. This design is being proposed to facilitate internal
selected primary studies. The factor alternatives (treatments) are         replication of the experiment.
executing the inclusion/exclusion process and quality assessment           To collect the data the participants should submit the list with the
using DSArch and the other is using only the protocol criteria             selected primary studies, the time taken to complete the selection
manually, without tool support for the alternatives.                       process and the time taken to complete the quality evaluation of
                                                                           the selected studies.
4.2 Material, Tool, and Training                                           4.5 Data Analysis
To perform the experiment a subset of primary studies drawn from
                                                                           Latin squares have an analysis procedure very similar to the
an SLR will be provided to participants. The protocol created for
                                                                           factorial experiments (multiple factors). In the case of factorial
the SLR will be provided and both the search process and the
                                                                           experiments, we can consider the lock variable as a factor. For the
visualization of the primary studies will be conducted by
                                                                           Latin square, we have two lock variables and a factor of interest.
REviewER1. It belongs to our research group and it is a tool that
                                                                           Also, there is another complicating factor: the experiment design
gives support to search process in some databases (ACM library,
                                                                           has multiple replications. This leads to a replicated Latin square
Engineering Village, IEEExplorer, Science Direct, Scopus, and
                                                                           design with equal columns (processes) and different lines
Springer Link) in a automated way and gives support to the
                                                                           (participants).
primary studies selection process. A version of REviewER will be
developed, containing the DS Architecture, to be used in the               A possible statistical test for the analysis is the ANOVA, as
evaluation. However, only the DSArchictecture should be                    proposed in [19], however, for being a parametric test, some
assessed.                                                                  preconditions should be evaluated, and if one of them is violated a
                                                                           equivalent non-parametric test can be used.

1
    http://sites.google.com/site/eseportal/tools/reviewer


                                                                      12
4.6 Threats to Validity                                                          Journal of Systems and Software, vol. 80 (4), pp. 571-583,
Some possible threats to validity of the experiment are already                  2007. DOI: 10.1016/j.jss.2006.07.009.
being assessed.                                                             [3] Zhou Y., Zhang H., Huang X., Yang S., Babar M. A., and
                                                                                 Tang H.. 2015. Quality assessment of systematic reviews in
4.6.1 Internal Validity                                                          software engineering: a tertiary study. In Proceedings of the
The completion of the training can generate an apprenticeship in                 19th International Conference on Evaluation and Assessment
relation to DSArch, which may influence the evaluation process                   in Software Engineering (EASE '15). ACM, New York, NY,
of the primary studies when participants are not using the                       USA, Article 14 , 14 pages.
DSArch. However, if there is influence, it will be in favor of the          [4] Carver J. C., Hassler E., Hernandes E., and Kraft N. A..
null hypothesis.                                                                 Identifying barriers to the systematic literature review
                                                                                 process. In Empirical Software Engineering and
4.6.2 External Validity                                                          Measurement, 2013 ACM/IEEE International Symposium
The results may not be generalizable to all the researches that                  on, pages 203-212. IEEE, 2013.
perform SLRs, because we do not sample from the population of               [5] Cohen A. M., Hersh W. R., Peterson K., and Yen P. Y..
SLRs researchers, but we intend to make a satisfactory outcome                   Reducing workload in systematic review preparation using
that will bring evidence of the effectiveness of DSArch.                         automated citation classification. Journal of the American
                                                                                 Medical Informatics Association: JAMIA, 13(2):206–219,
5. CONTRIBUTIONS, FUTURE WORK                                                    2006. ISSN 1067-5027.
AND ADVICES                                                                 [6] Tomassetti F., Rizzo G., Vetro A., Ardito L., Torchiano M.,
The use of the DSArch might reduce the effort to complete SLR,                   and Morisio M. (2011). Linked Data approach for selection
which is one of the major problems encountered in conducting                     process automation in Systematic Reviews. In Proceedings of
this type of research. Another problem to be addressed by using                  the 15th Annual Conference on EASE, pp. 31 – 35.
DSArch is the primary studies subjective selection, which is                [7] Kitchenham B., Brereton P., Turner M., Niazi M., Linkman
biased, by automating part of the process. Another gain by                       S., Pretorius R., Budgen D., Refining the systematic
automating part of the process is decreasing the number of                       literature review process – two observer participant case
conflicts generated during the evaluation of studies within the                  studies, Empirical Software Engineering 15 (6) (2010) 619–
pairs. An expectation for the proposed algorithm is that it can be               653.
used to assist in the SLR data analysis process, since, the question        [8] Kitchenham B., Sjøberg D. I., Brereton O. P., Budgen D.,
answer principle can also be used at this stage, however, this                   Dybå T., Höst M., and Runeson P.. Trends in the quality of
analysis is out of the context of this thesis.                                   human-intensive software engineering experiments: a quasi-
                                                                                 experiment. ieee, 2013.
So far, it was conducted an ad-hoc literature review on the
proposed theme, as well a first design of the controlled                    [9] Dyba T., Dingsøyr T.. Strength of evidence in systematic
experiment that will be conducted to evaluate the DSArch. The                    reviews in software engineering. Proceedings of the Second
planned next steps are:                                                          ACM-IEEE international symposium on Empirical software
                                                                                 engineering and measurement. 2008, p. 178-187.
        Execute a systematic literature review of techniques for           [10] P. H. R. U. in Oxford. Critical appraisal skills programme.
         assessing the quality of primary studies, aiming at                     http://www.casp-uk.net/, 2013.
         choosing a technique to be semi-automated;
                                                                            [11] Kitchenham B., Sjøberg D. I., Brereton O. P., Budgen D.,
        Select and implement an ontology to represent the prior
                                                                                 Dybå T., Höst M., and Runeson P.. Can we evaluate the
         knowledge on the SLR;
                                                                                 quality of software engineering experiments? In Proceedings
        Development of the DSArch;
                                                                                 of the 2010 ACM-IEEE International Symposium on
        Refine the plan and execute the controlled experiment                   Empirical Software Engineering and Measurement, ACM,
         to assess the DSArch;                                                   2010.
        Analyze the obtained results;
                                                                            [12] Dieste O., Grimán A., Juristo N., and Saxena H..
        Write the Thesis.
                                                                                 “Quantitative Determination of the Relationship between
The main points where advices are needed:                                        Internal Validity and Bias in Software Engineering:
         The proposed architecture is consistent with the                       Consequences for Systematic Literature Reviews,” Proc. Int’l
          problem found?                                                         Symp. Empirical Software Eng. and Metrics, pp. 285-288,
         The experimental design can evaluate the proposed                      2011.
          architecture?                                                     [13] Biolchini J., Mian P., Natali A., Conte T., Travassos G..
         The statistical test, ANOVA can evaluate the data                      Scientific research ontology to support systematic review in
          generated by the experiment?                                           software engineering. Advanced Engineering Informatics,
                                                                                 vol.      21     (2),    pp.     133-151,     2007.     DOI:
6. REFERENCES                                                                    10.1016/j.aei.2006.11.006.
[1] Kitchenham B. A. and Charters S.. Guidelines for performing             [14] Knaublock, H.. Protégé-OWL. 2003. Avaiable at
    systematic literature reviews in software engineering,                       http://protege.stanford.edu. Accessed on 07/03/2015.
    Technical Report: 2007.                                                 [15] Labs, H.. Jena: A free and open source Java framework for
[2] Brereton P., Kitchenham B. A., Budgen D., Turner M. and                      building Semantic Web and Linked Data applications. 2010.
    Khalil M.. Lessons from applying the systematic literature                   Avaiable at http://jena.apache.org. Accessed on 07/03/2015.
    review process within the software engineering domain.


                                                                       13
[16] Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.:             [18] Bo W. and Yunqing L., "Research on the Design of the
     GATE: A framework and graphical development                              Ontology-Based       Automatic      Question     Answering
     environment for robust NLP tools and applications. In: Proc.             System," Computer Science and Software Engineering, 2008
     of the 40th Anniversary Meeting of the ACL. (2002).                      International Conference on , vol.5, no., pp.871,874, 12-14
[17] Witte R., Li Q., Zhang Y. and Rilling J., Ontological Text               Dec. 2008.
     Mining of Software Documents, 12th International                    [19] Juristo N., Moreno A. M., Basics of Software Engineering
     Conference on Applications of Natural Language to                        Experimentation,     Springer      Publishing     Company,
     Information Systems (NLDB 2007, Paris, France, June 27-                  Incorporated, 2010.
     29, 2007.


                                                                    14