=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards Automatic Grading of Evidence
|pdfUrl=https://ceur-ws.org/Vol-744/paper7.pdf
|volume=Vol-744
}}
==Towards Automatic Grading of Evidence==
Towards Automatic Grading of Evidence
Abeed Sarker1 , Diego Mollá-Aliod1 , and Cécile Paris2
1
Centre for Language Technology
Department of Computing, Macquarie University
Sydney, NSW 2109, Australia
{abeed.sarker,diego.molla-aliod}@mq.edu.au,
URL: http://www.clt.mq.edu.au
2
CSIRO – ICT Centre,
Locked Bag 17, North Ryde, Sydney, NSW 1670, Australia
cecile.paris@csiro.au
URL: http://www.csiro.au
Abstract. The practice of Evidence Based Medicine requires practi-
tioners to extract evidence from published medical literature and grade
the extracted evidence in terms of quality. With the goal of automating
the time-consuming grading process, we assess the effects of a number of
factors on the grading of the evidence. The factors include the publica-
tion types of individual articles, publication years, journal information
and article titles. We model the evidence grading problem as a super-
vised classification problem and show, using several machine learning
algorithms, that the use of publication types alone as features gives an
accuracy close to 70%. We also show that the other factors do not have
any notable effects on the evidence grades.
1 Introduction
An important step for physicians who practise Evidence Based Medicine (EBM)
is the grading of the quality of the clinical evidence present in the medical liter-
ature. Evidence grading is a manual process, and the time required to perform
it adds to the already time-consuming nature of EBM practice [6, 5]. The aim
of our work is to identify the extent to which evidence grades can be automat-
ically determined from specific information about each publication, such as the
publication type, year of publication, journal name and title. In the following
sections, we present a brief overview of EBM and evidence grading, followed by
a discussion of our approach, results and planned future work towards building
an automatic evidence grading system.
2 Evidence Based Medicine and Evidence Grading
EBM is the ‘conscientious, explicit, and judicious use of current best evidence
in making decisions about the care of individual patients’ [16]. Current clini-
cal guidelines urge physicians to practise EBM when providing care for their
51
2
Towards Abeed Sarker,
Automatic Diego
Grading Mollá-Aliod and Cécile Paris
of Evidence
patients. Good practice of EBM requires practitioners to search for the best
quality evidence, synthesise collected information and grade the quality of the
evidence.
2.1 The Strength of Recommendation Taxonomy
There are over 100 grading scales to specify grades of evidence in use today.
The Strength of Recommendation Taxonomy (SORT) [4] is one such grading
scale. It is a simple, straightforward and comprehensive grading system that
can be applied throughout the medical literature. Consequently, it is used by
various family medicine and primary care journals, such as the Journal of Family
Practice (JFP)3 . SORT uses three ratings — A (strong), B (moderate) and
C (weak) — to specify the Strength of Recommendation (SOR) of a body of
evidence. Due to the popularity of this grading system, we use it as our target
grading scheme.
2.2 Factors Influencing Evidence Grades
A number of factors influence the final grade assigned to an evidence obtained
from one or more published studies. According to Ebell et al. [4], these factors in-
clude: qualities of evidence of the individual studies, types of evidence presented
in the studies (i.e., patient vs disease-oriented4 ) and consistency of outcomes
presented. In SORT, grade A reflects a recommendation based on consistent
and good-quality, patient-oriented evidence; grade B reflects a recommendation
based on inconsistent or limited-quality patient-oriented evidence; and grade C
reflects a recommendation based on consensus, usual practice, opinion or disease-
oriented evidence.
3 Related Work
To the best of our knowledge, there is no existing work addressing automatic evi-
dence grading directly, although there is work on related topics. Related research
has focused mostly on automatic quality assessment of medical publications for
purposes such as retrieval and post-retrieval re-ranking, where approaches based
on word co-occurrences [7] and bibliometrics [14] have been proposed for improv-
ing the retrieval of medical documents. Tang et al. [18] propose a post-retrieval
re-ranking approach that attempts to re-rank results returned by a search en-
gine, which may or may not be published research work. However, their approach
is only tested in a specific sub-domain (i.e., Depression) of the medical domain.
Kilicoglu et al. [9] focus on identifying high-quality medical articles and build on
3
http://www.jfponline.com
4
Patient-oriented evidence measures outcomes that matter to patients: morbidity,
mortality, symptom improvement, cost reduction and quality of life; disease-oriented
evidence measures intermediate, physiologic, or surrogate end points that may or
may not reflect improvements in patient outcomes, e.g., blood pressure.
52
Towards Automatic Grading of Evidence
Towards Automatic Grading of Evidence 3
the work by Aphinyanaphongs et al. [1]. They use machine learning and obtain
73.7% precision and 61.5% recall. These approaches rely heavily on meta-data
associated with the articles, making them dependent on the database from which
the articles are retrieved. Hence, these approaches would not work on publica-
tions that do not have associated meta-data.
The definitions of ‘good-quality evidence’ [4] suggest that the publication
types of medical articles are good indicators of their qualities. Literature in the
medical domain consists of a large number of publication types of varying qual-
ities5 . For example, a randomised controlled trial is of much higher quality than
a case study of a single patient. Evidence obtained from the former is thus more
reliable. Greenhalgh [8] mentions some other factors that influence the grade of
an evidence, such as the number of subjects included in a study and the mech-
anism by which subjects are allocated (e.g., randomisation/ no randomisation),
but the latter is generally specified by the publication type (e.g., randomised
controlled trial) of the article. Recently, Sarker and Mollá [17] emphasised on
the importance of publication types for SOR determination and showed that au-
tomatic identification of high-quality publication types (e.g., Systematic Review
and Randomised Controlled Trial) is relatively simple. Lin and Demner-Fushman
[3] also acknowledged the importance of publication types in determining the
quality of clinical evidence. They use a working definition of the ‘strength of
evidence’ as a sum of the scores given to journal types, publication types and
publication years of individual publications. Their scores are used for citation
ranking, not evidence grading, and therefore their results cannot be compared
to ours. However, their research does suggest that the journal names and publi-
cation years have an influence on the qualities of individual publications, which
in turn may influence the grade of evidence obtained from them.
4 Methods
We used the corpus6 proposed by Mollá [11] to collect our data. Each record
in the corpus is a clinical query obtained from the ‘Clinical Inquiries’ section of
JFP. Each query is accompanied by one or more evidence based answers and each
answer is generated from one or more medical publications. Furthermore, each
answer contains its SOR, a list of publication references and a brief description of
the publications including their publication types. From the corpus, we collected
all evidence based answers that had their SORs specified. Our final set consists
of 1132 evidence based answers generated from 2713 medical documents. Of the
1132 answers, 330 are of grade A, 511 of B and 291 of C. We grouped together
publication types having low frequency and similar quality levels, since it was
not possible to accommodate all publication types. Our final set consisted of
11 groups of known publication types, each having a different quality level, and
5
A list of publication types used by the US National Library of Medicine can be found
at http://www.nlm.nih.gov/mesh/pubtypes2006.html. This list is not exhaustive.
6
The corpus is available to the research community. The authors of this paper can be
contacted for details.
53
4
Towards Abeed Sarker,
Automatic Diego
Grading Mollá-Aliod and Cécile Paris
of Evidence
1 group of unknown types, as shown in Figure 1. Based on our collected data,
we considered 45.1% — the accuracy when all instances are classified as B (the
majority class) — as the baseline for our experiments.
4.1 Distribution of Publication Types over SORs
In an initial analysis, we studied the distribution of publication types over the
SOR grades (Figure 1). In the figure, ‘Other Study’ refers to low frequency
studies (e.g., Observational Study), ‘Other Clinical Trial’ refers to clinical trials
other than ‘Randomised Controlled Trials’ (RCT) and ‘Unknown’ refers to arti-
cles whose publication types are not known. A clear pattern in the distribution
of publication types over SORs can be seen. For SOR A, evidence primarily
comes from RCTs, Systematic Reviews and Meta-analyses, and the numbers
drop significantly for other publication types. For SOR C evidence, most of the
evidence comes from publications presenting expert opinion, case series/reports
and consensus guidelines. The distribution for SOR B has the largest spread with
Cohort studies having the highest frequency. The distributions suggest that the
publication types play an important role in determining the SOR.
Fig. 1. Distribution of publication types across SORs.
4.2 SOR Prediction from Publication Types
To test the extent to which SORs can be predicted from the publication types,
we performed basic experimentation using machine learning. We modeled the
grading of evidence as a classification problem, using only the publication types
of the articles as features. Each instance in our model represents an evidence
based answer and is composed of the SOR class and a vector containing the
54
Towards Automatic Grading of Evidence
Towards Automatic Grading of Evidence 5
counts of each of the 12 publication types shown in Figure 1. Based on the
publication types associated with each evidence, the classifiers attempt to predict
the SOR (A, B or C).
We used two-thirds of our data for training and the remaining as held-out
test data. For both sets, we kept the proportions of instances belonging to the
three classes the same as their proportions in the whole data set. We performed
our experimentation using the software package Weka7 . Weka provides imple-
mentations of a range of classifiers organised into generic groups, and in our
preliminary analysis we experimented on our training data with multiple clas-
sifiers belonging to each generic group. We chose five classifiers that produced
good results on our training data and have also been shown to produce good
results on similar problems in the past. The five chosen classifiers were (the
names used in Weka shown in brackets): Bayes Net, SVMs (SMO), K-Nearest
Neighbour (IBk), Multinomial Logistic Regression (Logistic) [10]8 and C4.5 De-
cision Tree (J48) [15]. For specific classifiers, we performed simple parameter
tuning and chose parameter values that produced best results for stratified 10-
fold cross validations on the training set. For the Bayes Net classifier, we used
the K2 search algorithm [2] for local score metrics and the simple estimator for
estimating conditional probability tables. For SVMs, we used John Platt’s [13]
sequential minimal optimisation algorithm and solved our multi-class problem
using pairwise (1-vs-1) classification. We used an RBF kernel for the SVMs,
normalised all attributes and used a grid search to find good values for the pa-
rameters γ and C. To find the best value of K for the K-Nearest Neighbour
algorithm, we searched through all odd values of K from 1 to 101. For the C4.5
Decision Tree classifier, we searched between 2−5 and 2−1 to find the best value
for the confidence factor parameter.
Classifier Accuracy (%) 95% CI Parameters
Bayes Net 66.578 61.6-71.3 K2, SimpleEstimator
SVMs 68.449 63.5-73.1 γ = 1.0, C = 27
K-Nearest Neighbour 68.717 63.8-73.4 K=7
Logistic Regression 67.380 62.4-72.1 ..
C4.5 68.182 63.2-72.9 conf idenceF actor = 2−1
Table 1. Accuracies, 95% confidence intervals and specific parameter values for various
classifiers, using only publication types as features.
7
http://www.cs.waikato.ac.nz/ml/weka/
8
The Weka implementation of this algorithm is slightly different from the original im-
plementation. Details can be found at: http://www.java2s.com/Open-Source/Java-
Document/Science/weka/weka/classifiers/functions/Logistic.java.htm
55
6
Towards Abeed Sarker,
Automatic Diego
Grading Mollá-Aliod and Cécile Paris
of Evidence
4.3 SOR Prediction from other Factors
In addition to publication types, we attempted to check the influence of other fac-
tors such as journal information and publication year following Lin and Demner-
Fushman’s [3] work. We added the two feature sets — journal name and publica-
tion year — to our data, and performed further experimentation by adding title
information of each article as a feature set. We suspected that titles may help
to identify the qualities of individual publications, since they sometimes provide
useful information about how the studies are carried out (e.g., ‘A Double-blind,
Placebo-controlled Trial’). In our model, we represented the titles and journal
names using uni- and bigrams. Prior to generating the n-grams, we processed
the titles by removing stop words, stemming the remaining words using the
Porter stemmer and removing words occurring less than five times across the
whole data set. We repeated the experimental procedures mentioned above with
various combinations of these feature sets.
5 Results and Discussion
Using only publication types as a feature set, we obtained classification accura-
cies of approximately 66 - 69% (over 20% improvement over the baseline) with
various classifiers on our held-out test set. Table 1 shows the accuracies obtained
by the five above-mentioned classifiers along with 95% confidence intervals9 for
the accuracies and important parameter values for specific classifiers.
An analysis of the incorrect classifications revealed that there were few errors
between A and C, which is exactly what was expected based on their very dif-
ferent distributions of publication types. The most common errors were between
SOR A and B, and SOR C classified as B. Our manual analysis revealed that
errors were caused primarily by factors such as sizes of studies, consistency and
types of outcomes, which our classifiers did not take into account. For example,
an essential condition for an evidence to be of grade A or B is the presence of
patient-oriented outcome, irrespective of the type of study. At the same time,
for certain types of publications, such as Cohort studies, the sizes of the studies
significantly influence the qualities. Unaware of these information, our classi-
fiers classified all evidences obtained primarily from Cohort studies as grade B.
Furthermore, evidence obtained primarily from Meta-analyses and Systematic
Reviews were graded as A, irrespective of the consistency or types of outcomes
presented in the studies.
Our experiments suggest that adding factors such as journal names, publica-
tion years and article titles to the publication types do not significantly influence
the SORs. Table 2 shows the highest accuracies obtained using various combina-
tions of feature sets, the 95% confidence intervals and the classifiers producing
these results. From the table it is evident that the absence of publication types as
a feature set causes significant drops in accuracy. Although incorporation of arti-
cle titles as a feature set produces marginally better accuracies compared to our
9
Calculated using the package R’s binom.test function (http://www.r-project.org/).
56
Towards Automatic Grading of Evidence
Towards Automatic Grading of Evidence 7
Features Accuracy (%) 95% CI Classifier
Journal, Pub. Year, Title and Pub. Type 63.636 58.5-68.5 C4.5
Pub. Type and Pub. Year 66.578 61.6-71.3 C4.5
Pub. Type and Title 67.380 62.4-72.1 C4.5
Pub. Type and Journal 63.904 58.8-68.8 C4.5
Journal, Pub. Year and Title 50.802 45.6-56.0 SVMs
Journal and Pub. Year 46.257 41.1-51.5 SVMs
Title only 51.070 45.9-56.2 SVMs
Pub. Year only 47.594 42.4-52.8 Bayes Net
Journal only 47.326 42.2-52.5 Bayes Net
Table 2. Accuracies, 95% confidence intervals, and best performing classifiers for var-
ious feature sets.
baseline, our experiments show that no significant improvements are achieved
when this feature set is combined with publication types. The other feature sets,
alone or in combination with each other, do not give a statistically significant
improvement over the baseline.
6 Conclusion and Future Work
In this paper, we have discussed some experiments towards the challenging task
of automatic evidence grading. Our experiments have produced encouraging re-
sults, suggesting that automatic grading of evidence is possible and modeling
evidence grading as a classification problem might be an effective approach.
Using publication types alone as features, it is possible to predict SORs with
close to 70% accuracy. The experiments also show that information such as
journal names, publication years and article titles do not significantly influence
the SORs. Our manual analysis revealed that a large number of the errors are
caused due to the absence of information such as study sizes and consistency
among studies. Our future work will focus on incorporating these information as
features. There has already been some research on polarity assessment of clini-
cal outcomes [12], and extraction of specific information from medical abstracts
(such as study sizes) [3]. We will attempt to build on these works for generating
more features for our classifiers.
It would also be interesting to perform an assessment of agreement among
human graders of clinical evidence. The evidence based summaries contained in
JFP are prepared by domain experts, and there is a possibility that there are
inconsistencies among human generated grades. Such an assessment will require
significant time contribution from domain experts.
Acknowledgments
This research is jointly funded by Macquarie University and CSIRO. The authors
would like to thank the anonymous reviewers for their helpful comments.
57
8
Towards Abeed Sarker,
Automatic Diego
Grading Mollá-Aliod and Cécile Paris
of Evidence
References
1. Aphinyanaphongs, Y., Tsamardinos, I., Statnikov, A., Hardin, D., Aliferis, C.F.:
Text categorization models for high-quality article retrieval in internal medicine.
JAMIA 12(2), 207–216 (2005)
2. Cooper, G.F., Herskovits, E.: A bayesian method for the induction of probabilistic
networks from data. Machine Learning 9, 309–347 (October 1992)
3. Demner-Fushman, D., Lin, J.J.: Answering clinical questions with knowledge-based
and statistical techniques. Computational Linguistics 33(1), 63–103 (2007)
4. Ebell, M.H., Siwek, J., Weiss, B.D., Woolf, S.H., Susman, J., Ewigman, B., Bow-
man, M.: Strength of recommendation taxonomy (SORT): a patient-centered ap-
proach to grading evidence in the medical literature. Am Fam Physician 69(3),
548–556 (Feb 2004)
5. Ely, J., Osheroff, J.A., Chambliss, M.L., Ebell, M.H., Rosenbaum, M.E.: Answering
physicians’ clinical questions: Obstacles and potential solutions. JAMIA 12(2),
217–224 (2005)
6. Ely, J.W., Osheroff, J.A., Ebell, M.H., Bergus, G.R., Levy, B.T., Chambliss, M.L.,
Evans, E.R.: Analysis of questions asked by family doctors regarding patient care.
BMJ 319(7206), 358–361 (Aug 1999)
7. Goetz, T., von der Lieth, C.W.: PubFinder: a tool for improving retrieval rate of
relevant PubMed abstracts. Nucleic Acids Research 33, W774–W778 (2005)
8. Greenhalgh, T.: How to read a paper: The Basics of Evidence-based Medicine.
Blackwell Publishing, 3 edn. (2006)
9. Kilicoglu, H., Demner-Fushman, D., Rindflesch, T.C., Wilczynski, N.L., Haynes,
B.R.: Towards automatic recognition of scientifically rigorous clinical research ev-
idence. JAMIA 16(1), 25–31 (January 2009)
10. Le Cessie, S., Van Houwelingen, J.C.: Ridge Estimators in Logistic Regression.
Applied Statistics 41(1), 191–201 (1992)
11. Mollá, D.: A Corpus for Evidence Based Medicine Summarisation. In: Proceedings
of the Australasian Language Technology Association Workshop. vol. 8 (2010)
12. Niu, Y., Zhu, X., Li, J., Hirst, G.: Analysis of polarity information in medical text.
In: Proceedings of the AMIA Annual Symposium. pp. 570–574 (2005)
13. Platt, J.C.: Fast training of support vector machines using sequential minimal
optimization. In: Advances in Kernel Methods: Support Vector Learning. pp. 185–
208. MIT Press, Cambridge, MA (1998)
14. Plikus, M., Zhang, Z., Chuong, C.M.: PubFocus: semantic MEDLINE/PubMed
citations analytics through integration of controlled biomedical dictionaries and
ranking algorithm. BMC Bioinformatics 7(1), 424–439 (2006)
15. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publish-
ers Inc. (1993)
16. Sackett, D.L., Rosenberg, W.M.C., Gray, J.A.M., Haynes, R.B., Richardson, W.S.:
Evidence based medicine: what it is and what it isn’t. BMJ 312(7023), 71–72 (1996)
17. Sarker, A., Mollá-Aliod, D.: A Rule-based Approach for Automatic Identification
of Publication Types of Medical Papers. In: Proceedings of the ADCS Annual
Symposium. Melbourne, Australia (December 2010)
18. Tang, T., Hawking, D., Sankaranarayana, R., Griffiths, K., Craswell, N.: Quality-
Oriented Search for Depression Portals. In: Boughanem, M., Berrut, C., Mothe,
J., Soule-Dupuy, C. (eds.) Advances in Information Retrieval, Lecture Notes in
Computer Science, vol. 5478, chap. 60, pp. 637–644. Springer Berlin / Heidelberg,
Berlin, Heidelberg (2009)
58