=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards Automatic Grading of Evidence
|pdfUrl=https://ceur-ws.org/Vol-744/paper7.pdf
|volume=Vol-744
}}
==Towards Automatic Grading of Evidence==
<pdf width="1500px">https://ceur-ws.org/Vol-744/paper7.pdf</pdf>
<pre>
      Towards Automatic Grading of Evidence

             Abeed Sarker1 , Diego Mollá-Aliod1 , and Cécile Paris2
                        1
                          Centre for Language Technology
                 Department of Computing, Macquarie University
                          Sydney, NSW 2109, Australia
                 {abeed.sarker,diego.molla-aliod}@mq.edu.au,
                        URL: http://www.clt.mq.edu.au
                             2
                               CSIRO – ICT Centre,
             Locked Bag 17, North Ryde, Sydney, NSW 1670, Australia
                             cecile.paris@csiro.au
                           URL: http://www.csiro.au


      Abstract. The practice of Evidence Based Medicine requires practi-
      tioners to extract evidence from published medical literature and grade
      the extracted evidence in terms of quality. With the goal of automating
      the time-consuming grading process, we assess the effects of a number of
      factors on the grading of the evidence. The factors include the publica-
      tion types of individual articles, publication years, journal information
      and article titles. We model the evidence grading problem as a super-
      vised classification problem and show, using several machine learning
      algorithms, that the use of publication types alone as features gives an
      accuracy close to 70%. We also show that the other factors do not have
      any notable effects on the evidence grades.


1   Introduction
An important step for physicians who practise Evidence Based Medicine (EBM)
is the grading of the quality of the clinical evidence present in the medical liter-
ature. Evidence grading is a manual process, and the time required to perform
it adds to the already time-consuming nature of EBM practice [6, 5]. The aim
of our work is to identify the extent to which evidence grades can be automat-
ically determined from specific information about each publication, such as the
publication type, year of publication, journal name and title. In the following
sections, we present a brief overview of EBM and evidence grading, followed by
a discussion of our approach, results and planned future work towards building
an automatic evidence grading system.


2   Evidence Based Medicine and Evidence Grading
EBM is the ‘conscientious, explicit, and judicious use of current best evidence
in making decisions about the care of individual patients’ [16]. Current clini-
cal guidelines urge physicians to practise EBM when providing care for their


                                        51
  2
Towards  Abeed Sarker,
        Automatic      Diego
                  Grading    Mollá-Aliod and Cécile Paris
                          of Evidence

  patients. Good practice of EBM requires practitioners to search for the best
  quality evidence, synthesise collected information and grade the quality of the
  evidence.

  2.1      The Strength of Recommendation Taxonomy
  There are over 100 grading scales to specify grades of evidence in use today.
  The Strength of Recommendation Taxonomy (SORT) [4] is one such grading
  scale. It is a simple, straightforward and comprehensive grading system that
  can be applied throughout the medical literature. Consequently, it is used by
  various family medicine and primary care journals, such as the Journal of Family
  Practice (JFP)3 . SORT uses three ratings — A (strong), B (moderate) and
  C (weak) — to specify the Strength of Recommendation (SOR) of a body of
  evidence. Due to the popularity of this grading system, we use it as our target
  grading scheme.

  2.2      Factors Influencing Evidence Grades
  A number of factors influence the final grade assigned to an evidence obtained
  from one or more published studies. According to Ebell et al. [4], these factors in-
  clude: qualities of evidence of the individual studies, types of evidence presented
  in the studies (i.e., patient vs disease-oriented4 ) and consistency of outcomes
  presented. In SORT, grade A reflects a recommendation based on consistent
  and good-quality, patient-oriented evidence; grade B reflects a recommendation
  based on inconsistent or limited-quality patient-oriented evidence; and grade C
  reflects a recommendation based on consensus, usual practice, opinion or disease-
  oriented evidence.


  3      Related Work
  To the best of our knowledge, there is no existing work addressing automatic evi-
  dence grading directly, although there is work on related topics. Related research
  has focused mostly on automatic quality assessment of medical publications for
  purposes such as retrieval and post-retrieval re-ranking, where approaches based
  on word co-occurrences [7] and bibliometrics [14] have been proposed for improv-
  ing the retrieval of medical documents. Tang et al. [18] propose a post-retrieval
  re-ranking approach that attempts to re-rank results returned by a search en-
  gine, which may or may not be published research work. However, their approach
  is only tested in a specific sub-domain (i.e., Depression) of the medical domain.
  Kilicoglu et al. [9] focus on identifying high-quality medical articles and build on
   3
       http://www.jfponline.com
   4
       Patient-oriented evidence measures outcomes that matter to patients: morbidity,
       mortality, symptom improvement, cost reduction and quality of life; disease-oriented
       evidence measures intermediate, physiologic, or surrogate end points that may or
       may not reflect improvements in patient outcomes, e.g., blood pressure.


                                             52
                                    Towards Automatic Grading of Evidence
Towards Automatic Grading of Evidence                                                 3

  the work by Aphinyanaphongs et al. [1]. They use machine learning and obtain
  73.7% precision and 61.5% recall. These approaches rely heavily on meta-data
  associated with the articles, making them dependent on the database from which
  the articles are retrieved. Hence, these approaches would not work on publica-
  tions that do not have associated meta-data.
      The definitions of ‘good-quality evidence’ [4] suggest that the publication
  types of medical articles are good indicators of their qualities. Literature in the
  medical domain consists of a large number of publication types of varying qual-
  ities5 . For example, a randomised controlled trial is of much higher quality than
  a case study of a single patient. Evidence obtained from the former is thus more
  reliable. Greenhalgh [8] mentions some other factors that influence the grade of
  an evidence, such as the number of subjects included in a study and the mech-
  anism by which subjects are allocated (e.g., randomisation/ no randomisation),
  but the latter is generally specified by the publication type (e.g., randomised
  controlled trial) of the article. Recently, Sarker and Mollá [17] emphasised on
  the importance of publication types for SOR determination and showed that au-
  tomatic identification of high-quality publication types (e.g., Systematic Review
  and Randomised Controlled Trial) is relatively simple. Lin and Demner-Fushman
  [3] also acknowledged the importance of publication types in determining the
  quality of clinical evidence. They use a working definition of the ‘strength of
  evidence’ as a sum of the scores given to journal types, publication types and
  publication years of individual publications. Their scores are used for citation
  ranking, not evidence grading, and therefore their results cannot be compared
  to ours. However, their research does suggest that the journal names and publi-
  cation years have an influence on the qualities of individual publications, which
  in turn may influence the grade of evidence obtained from them.


  4   Methods
  We used the corpus6 proposed by Mollá [11] to collect our data. Each record
  in the corpus is a clinical query obtained from the ‘Clinical Inquiries’ section of
  JFP. Each query is accompanied by one or more evidence based answers and each
  answer is generated from one or more medical publications. Furthermore, each
  answer contains its SOR, a list of publication references and a brief description of
  the publications including their publication types. From the corpus, we collected
  all evidence based answers that had their SORs specified. Our final set consists
  of 1132 evidence based answers generated from 2713 medical documents. Of the
  1132 answers, 330 are of grade A, 511 of B and 291 of C. We grouped together
  publication types having low frequency and similar quality levels, since it was
  not possible to accommodate all publication types. Our final set consisted of
  11 groups of known publication types, each having a different quality level, and
  5
    A list of publication types used by the US National Library of Medicine can be found
    at http://www.nlm.nih.gov/mesh/pubtypes2006.html. This list is not exhaustive.
  6
    The corpus is available to the research community. The authors of this paper can be
    contacted for details.


                                          53
  4
Towards  Abeed Sarker,
        Automatic      Diego
                  Grading    Mollá-Aliod and Cécile Paris
                          of Evidence

  1 group of unknown types, as shown in Figure 1. Based on our collected data,
  we considered 45.1% — the accuracy when all instances are classified as B (the
  majority class) — as the baseline for our experiments.

  4.1   Distribution of Publication Types over SORs
  In an initial analysis, we studied the distribution of publication types over the
  SOR grades (Figure 1). In the figure, ‘Other Study’ refers to low frequency
  studies (e.g., Observational Study), ‘Other Clinical Trial’ refers to clinical trials
  other than ‘Randomised Controlled Trials’ (RCT) and ‘Unknown’ refers to arti-
  cles whose publication types are not known. A clear pattern in the distribution
  of publication types over SORs can be seen. For SOR A, evidence primarily
  comes from RCTs, Systematic Reviews and Meta-analyses, and the numbers
  drop significantly for other publication types. For SOR C evidence, most of the
  evidence comes from publications presenting expert opinion, case series/reports
  and consensus guidelines. The distribution for SOR B has the largest spread with
  Cohort studies having the highest frequency. The distributions suggest that the
  publication types play an important role in determining the SOR.


                 Fig. 1. Distribution of publication types across SORs.


  4.2   SOR Prediction from Publication Types
  To test the extent to which SORs can be predicted from the publication types,
  we performed basic experimentation using machine learning. We modeled the
  grading of evidence as a classification problem, using only the publication types
  of the articles as features. Each instance in our model represents an evidence
  based answer and is composed of the SOR class and a vector containing the


                                          54
                                    Towards Automatic Grading of Evidence
Towards Automatic Grading of Evidence                                                    5

  counts of each of the 12 publication types shown in Figure 1. Based on the
  publication types associated with each evidence, the classifiers attempt to predict
  the SOR (A, B or C).
       We used two-thirds of our data for training and the remaining as held-out
  test data. For both sets, we kept the proportions of instances belonging to the
  three classes the same as their proportions in the whole data set. We performed
  our experimentation using the software package Weka7 . Weka provides imple-
  mentations of a range of classifiers organised into generic groups, and in our
  preliminary analysis we experimented on our training data with multiple clas-
  sifiers belonging to each generic group. We chose five classifiers that produced
  good results on our training data and have also been shown to produce good
  results on similar problems in the past. The five chosen classifiers were (the
  names used in Weka shown in brackets): Bayes Net, SVMs (SMO), K-Nearest
  Neighbour (IBk), Multinomial Logistic Regression (Logistic) [10]8 and C4.5 De-
  cision Tree (J48) [15]. For specific classifiers, we performed simple parameter
  tuning and chose parameter values that produced best results for stratified 10-
  fold cross validations on the training set. For the Bayes Net classifier, we used
  the K2 search algorithm [2] for local score metrics and the simple estimator for
  estimating conditional probability tables. For SVMs, we used John Platt’s [13]
  sequential minimal optimisation algorithm and solved our multi-class problem
  using pairwise (1-vs-1) classification. We used an RBF kernel for the SVMs,
  normalised all attributes and used a grid search to find good values for the pa-
  rameters γ and C. To find the best value of K for the K-Nearest Neighbour
  algorithm, we searched through all odd values of K from 1 to 101. For the C4.5
  Decision Tree classifier, we searched between 2−5 and 2−1 to find the best value
  for the confidence factor parameter.


          Classifier            Accuracy (%) 95% CI               Parameters

          Bayes Net                  66.578        61.6-71.3 K2, SimpleEstimator
          SVMs                       68.449        63.5-73.1      γ = 1.0, C = 27
          K-Nearest Neighbour        68.717        63.8-73.4          K=7
          Logistic Regression        67.380        62.4-72.1             ..
          C4.5                       68.182        63.2-72.9 conf idenceF actor = 2−1
  Table 1. Accuracies, 95% confidence intervals and specific parameter values for various
  classifiers, using only publication types as features.


  7
      http://www.cs.waikato.ac.nz/ml/weka/
  8
      The Weka implementation of this algorithm is slightly different from the original im-
      plementation. Details can be found at: http://www.java2s.com/Open-Source/Java-
      Document/Science/weka/weka/classifiers/functions/Logistic.java.htm


                                              55
  6
Towards  Abeed Sarker,
        Automatic      Diego
                  Grading    Mollá-Aliod and Cécile Paris
                          of Evidence

  4.3     SOR Prediction from other Factors

  In addition to publication types, we attempted to check the influence of other fac-
  tors such as journal information and publication year following Lin and Demner-
  Fushman’s [3] work. We added the two feature sets — journal name and publica-
  tion year — to our data, and performed further experimentation by adding title
  information of each article as a feature set. We suspected that titles may help
  to identify the qualities of individual publications, since they sometimes provide
  useful information about how the studies are carried out (e.g., ‘A Double-blind,
  Placebo-controlled Trial’). In our model, we represented the titles and journal
  names using uni- and bigrams. Prior to generating the n-grams, we processed
  the titles by removing stop words, stemming the remaining words using the
  Porter stemmer and removing words occurring less than five times across the
  whole data set. We repeated the experimental procedures mentioned above with
  various combinations of these feature sets.


  5      Results and Discussion

  Using only publication types as a feature set, we obtained classification accura-
  cies of approximately 66 - 69% (over 20% improvement over the baseline) with
  various classifiers on our held-out test set. Table 1 shows the accuracies obtained
  by the five above-mentioned classifiers along with 95% confidence intervals9 for
  the accuracies and important parameter values for specific classifiers.
      An analysis of the incorrect classifications revealed that there were few errors
  between A and C, which is exactly what was expected based on their very dif-
  ferent distributions of publication types. The most common errors were between
  SOR A and B, and SOR C classified as B. Our manual analysis revealed that
  errors were caused primarily by factors such as sizes of studies, consistency and
  types of outcomes, which our classifiers did not take into account. For example,
  an essential condition for an evidence to be of grade A or B is the presence of
  patient-oriented outcome, irrespective of the type of study. At the same time,
  for certain types of publications, such as Cohort studies, the sizes of the studies
  significantly influence the qualities. Unaware of these information, our classi-
  fiers classified all evidences obtained primarily from Cohort studies as grade B.
  Furthermore, evidence obtained primarily from Meta-analyses and Systematic
  Reviews were graded as A, irrespective of the consistency or types of outcomes
  presented in the studies.
      Our experiments suggest that adding factors such as journal names, publica-
  tion years and article titles to the publication types do not significantly influence
  the SORs. Table 2 shows the highest accuracies obtained using various combina-
  tions of feature sets, the 95% confidence intervals and the classifiers producing
  these results. From the table it is evident that the absence of publication types as
  a feature set causes significant drops in accuracy. Although incorporation of arti-
  cle titles as a feature set produces marginally better accuracies compared to our
   9
       Calculated using the package R’s binom.test function (http://www.r-project.org/).


                                            56
                                    Towards Automatic Grading of Evidence
Towards Automatic Grading of Evidence                                                  7


      Features                                  Accuracy (%) 95% CI Classifier

      Journal, Pub. Year, Title and Pub. Type       63.636      58.5-68.5      C4.5
      Pub. Type and Pub. Year                       66.578      61.6-71.3      C4.5
      Pub. Type and Title                           67.380      62.4-72.1      C4.5
      Pub. Type and Journal                         63.904      58.8-68.8      C4.5
      Journal, Pub. Year and Title                  50.802      45.6-56.0    SVMs
      Journal and Pub. Year                         46.257      41.1-51.5    SVMs
      Title only                                    51.070      45.9-56.2    SVMs
      Pub. Year only                                47.594      42.4-52.8 Bayes Net
      Journal only                                  47.326      42.2-52.5 Bayes Net
  Table 2. Accuracies, 95% confidence intervals, and best performing classifiers for var-
  ious feature sets.


  baseline, our experiments show that no significant improvements are achieved
  when this feature set is combined with publication types. The other feature sets,
  alone or in combination with each other, do not give a statistically significant
  improvement over the baseline.

  6    Conclusion and Future Work
  In this paper, we have discussed some experiments towards the challenging task
  of automatic evidence grading. Our experiments have produced encouraging re-
  sults, suggesting that automatic grading of evidence is possible and modeling
  evidence grading as a classification problem might be an effective approach.
  Using publication types alone as features, it is possible to predict SORs with
  close to 70% accuracy. The experiments also show that information such as
  journal names, publication years and article titles do not significantly influence
  the SORs. Our manual analysis revealed that a large number of the errors are
  caused due to the absence of information such as study sizes and consistency
  among studies. Our future work will focus on incorporating these information as
  features. There has already been some research on polarity assessment of clini-
  cal outcomes [12], and extraction of specific information from medical abstracts
  (such as study sizes) [3]. We will attempt to build on these works for generating
  more features for our classifiers.
      It would also be interesting to perform an assessment of agreement among
  human graders of clinical evidence. The evidence based summaries contained in
  JFP are prepared by domain experts, and there is a possibility that there are
  inconsistencies among human generated grades. Such an assessment will require
  significant time contribution from domain experts.

  Acknowledgments
  This research is jointly funded by Macquarie University and CSIRO. The authors
  would like to thank the anonymous reviewers for their helpful comments.


                                           57
  8
Towards  Abeed Sarker,
        Automatic      Diego
                  Grading    Mollá-Aliod and Cécile Paris
                          of Evidence

  References
   1. Aphinyanaphongs, Y., Tsamardinos, I., Statnikov, A., Hardin, D., Aliferis, C.F.:
      Text categorization models for high-quality article retrieval in internal medicine.
      JAMIA 12(2), 207–216 (2005)
   2. Cooper, G.F., Herskovits, E.: A bayesian method for the induction of probabilistic
      networks from data. Machine Learning 9, 309–347 (October 1992)
   3. Demner-Fushman, D., Lin, J.J.: Answering clinical questions with knowledge-based
      and statistical techniques. Computational Linguistics 33(1), 63–103 (2007)
   4. Ebell, M.H., Siwek, J., Weiss, B.D., Woolf, S.H., Susman, J., Ewigman, B., Bow-
      man, M.: Strength of recommendation taxonomy (SORT): a patient-centered ap-
      proach to grading evidence in the medical literature. Am Fam Physician 69(3),
      548–556 (Feb 2004)
   5. Ely, J., Osheroff, J.A., Chambliss, M.L., Ebell, M.H., Rosenbaum, M.E.: Answering
      physicians’ clinical questions: Obstacles and potential solutions. JAMIA 12(2),
      217–224 (2005)
   6. Ely, J.W., Osheroff, J.A., Ebell, M.H., Bergus, G.R., Levy, B.T., Chambliss, M.L.,
      Evans, E.R.: Analysis of questions asked by family doctors regarding patient care.
      BMJ 319(7206), 358–361 (Aug 1999)
   7. Goetz, T., von der Lieth, C.W.: PubFinder: a tool for improving retrieval rate of
      relevant PubMed abstracts. Nucleic Acids Research 33, W774–W778 (2005)
   8. Greenhalgh, T.: How to read a paper: The Basics of Evidence-based Medicine.
      Blackwell Publishing, 3 edn. (2006)
   9. Kilicoglu, H., Demner-Fushman, D., Rindflesch, T.C., Wilczynski, N.L., Haynes,
      B.R.: Towards automatic recognition of scientifically rigorous clinical research ev-
      idence. JAMIA 16(1), 25–31 (January 2009)
  10. Le Cessie, S., Van Houwelingen, J.C.: Ridge Estimators in Logistic Regression.
      Applied Statistics 41(1), 191–201 (1992)
  11. Mollá, D.: A Corpus for Evidence Based Medicine Summarisation. In: Proceedings
      of the Australasian Language Technology Association Workshop. vol. 8 (2010)
  12. Niu, Y., Zhu, X., Li, J., Hirst, G.: Analysis of polarity information in medical text.
      In: Proceedings of the AMIA Annual Symposium. pp. 570–574 (2005)
  13. Platt, J.C.: Fast training of support vector machines using sequential minimal
      optimization. In: Advances in Kernel Methods: Support Vector Learning. pp. 185–
      208. MIT Press, Cambridge, MA (1998)
  14. Plikus, M., Zhang, Z., Chuong, C.M.: PubFocus: semantic MEDLINE/PubMed
      citations analytics through integration of controlled biomedical dictionaries and
      ranking algorithm. BMC Bioinformatics 7(1), 424–439 (2006)
  15. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publish-
      ers Inc. (1993)
  16. Sackett, D.L., Rosenberg, W.M.C., Gray, J.A.M., Haynes, R.B., Richardson, W.S.:
      Evidence based medicine: what it is and what it isn’t. BMJ 312(7023), 71–72 (1996)
  17. Sarker, A., Mollá-Aliod, D.: A Rule-based Approach for Automatic Identification
      of Publication Types of Medical Papers. In: Proceedings of the ADCS Annual
      Symposium. Melbourne, Australia (December 2010)
  18. Tang, T., Hawking, D., Sankaranarayana, R., Griffiths, K., Craswell, N.: Quality-
      Oriented Search for Depression Portals. In: Boughanem, M., Berrut, C., Mothe,
      J., Soule-Dupuy, C. (eds.) Advances in Information Retrieval, Lecture Notes in
      Computer Science, vol. 5478, chap. 60, pp. 637–644. Springer Berlin / Heidelberg,
      Berlin, Heidelberg (2009)


                                            58

</pre>