-

CLEF 2017 Technologically Assisted Reviews in Empirical Medicine Overview

Evangelos Kanoulas

E.Kanoulas@uva.nl 2

Dan Li

Leif Azzopardi

leif.azzopardi@strath.ac.uk 1

Rene Spijker

R.Spijker-2@umcutrecht.nl 0 0 Cochrane Netherlands and UMC Utrecht, Julius Center for Health Sciences and Primary Care , Netherlands 1 Computer and Information Sciences, University of Strathclyde , Glasgow , UK 2 Informatics Institute, University of Amsterdam , Netherlands

Systematic reviews are a widely used method to provide an overview over the current scientific consensus, by bringing together multiple studies in a reliable, transparent way. The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying all relevant studies in an unbiased way both complex and time consuming to the extent that jeopardizes the validity of their findings and the ability to inform policy and practice in a timely manner. The CLEF 2017 e-Health Lab Task 2 focuses on the efficient and effective ranking of studies during the abstract and title screening phase of conducting Diagnostic Test Accuracy systematic reviews. We constructed a benchmark collection of fifty such reviews and the corresponding relevant and irrelevant articles found by the original Boolean query. Fourteen teams participated in the task, submitting 68 automatic and semi-automatic runs, using information retrieval and machine learning algorithms over a variety of text representations, in a batch and iterative manner. This paper reports both the methodology used to construct the benchmark collection, and the results of the evaluation.

Evaluation Information Retrieval Systematic Reviews TAR Text Classification Active Learning

Evidence-based medicine has become an important pillar in health care and policy making. In order to practice evidence-based medicine, it is important to have a clear overview over the current scientific consensus. These overviews are provided in systematic review articles, that summarize all available evidence regarding a certain topic (e.g., a treatment or diagnostic test). In order to write a systematic review, researchers have to conduct a search that will retrieve all the studies that are relevant. The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying relevant studies in an unbiased way both complex and time consuming to the extent that jeopardizes the validity of their findings and the ability to inform policy and practice in a timely manner. Hence, the need for automation in this process becomes of utmost importance. Finding all relevant studies in a corpus is a difficult task, known in the Information Retrieval (IR) domain as the total recall problem.

To this date, retrieval of evidence to inform systematic reviews is being conducted in multiple stages: 1. Boolean Search: At the first stage information specialists build a broad Boolean query expressing what constitutes relevant information. The query is then submitted to a medical database containing titles, abstracts, and indexing terms of a controlled vocabulary of medical studies. The result is a set, A, of potentially interesting studies. 2. Title and Abstract Screening: At a second stage experts are screening the titles and abstracts of the returned set and decide which one of those hold potential value for their systematic review, a set D. If screening an abstract has a cost Ca, screening all |A| abstracts has a cost of Ca jAj. 3. Study Screening: At a third stage experts are downloading the full text of the potentially relevant abstracts, D, identified in the previous phase and examine the content to decide whether indeed these studies are relevant or not. Examining a document has typically a larger cost of Cd > Ca. The result of the second screening is a set of references to be included in the systematic review.

Unfortunately, the precision of the Boolean searches is typically low, hence reviewers often need to look manually through many thousands of irrelevant titles and abstracts in order to identify a small number of relevant ones. Furthermore, the recall of the searches is often assumed to be 100%, which may not be the case.

To overcome some of the limitations of the Boolean search, researchers have been testing the effectiveness of machine learning and information retrieval methods. O’Mara-Eves et al.[ 15 ] provide a systematic review of the use of text mining techniques for study identification in systematic reviews.

The goal of this lab is to bring together academic, commercial, and government researchers that will conduct experiments and share results on automatic methods to retrieve relevant studies with high precision and high recall, and release a reusable test collection that can be used as a reference for comparing different retrieval and mining approaches in the field of medical systematic reviews. 2

Benchmark Collection

To construct the benchmark collection, the organizers of the task considered 58 systematic reviews on Diagnostic Test Accuracy conducted by the Cochrane researchers. These reviews are publicly available through the Cochrane Library4 4 http://www.cochranelibrary.com/ and can be identified by setting the topic filter in the library to "Diagnostic" and "Diagnostic Test Accuracy" and the stage fitler to "Review". At the date of the publication of this article 79 such studies are available, however the last 22 were performed after the organizers put the collection together. The 58 systematic reviews considered can be found in the Appendix of this articles at Table 6.

Participants were provided with two data sets: (a) a development set, and (b) a test set. The development set consists of 20 topics for Diagnostic Test Accuracy (DTA) systematic reviews, while the test set consists of 30 topics. For both sets, one topic file and two files of relevance judgments at abstract and document level respectively are constructed (qrel’s).

The topic file is generated through the following procedure. For each systematic review, we reviewed the search strategy from the corresponding study in Cochrane Library. A search strategy, among others, consists of the exact Boolean query developed and submitted to a medical database, at the time the review was conducted, and typically can be found in the Appendix of the study. Rene Spijker, a co-author of this work and a Cochrane information specialist examined the grammatical correctness of the search query and specified the date range which dictated the valid dates for the articles to be included in this systematic review. The date range was necessary because a study published after the systematic review should not be included even though it might be relevant, since that would require manually examining its content to quantify its relevance. Important note: A number of medical databases, and search interfaces to these databases is available for search, and for each one information specialists construct a different variation of their query that better fits the data and meta-data of the database. For this task, we only considered the Boolean query constructed for the MEDLINE database, using the Wolters Kluwer Ovid interface. Then we submitted the constructed Boolean query to the OVID system5 and collected all the returned PubMed document identification numbers (PMID’s) which satisfied the date range constraint. This step was automated by a Python script we put together and through an interface available to the University of Amsterdam6. Out of the 58 reviews 8 were discarded since the provided Boolean query was not in the right format, which made it difficult if not impossible to reconstruct the set of PMID’s, hence the 50 topics in the development and test set.

The topic file is in a text format and contains four sections, Topic, Title, Query, and PMID’s, where Topic is the topic ID, a substring of DOI of the document (e.g. CD010438 for 10.1002/14651858.CD010438.pub2), and PMID’s are the document IDs returned by the Boolean query. The PIDs can be used to access the corresponding document through the National Center for Biotechnology Information (NCBI)7. An example of a topic file can be viewed below. 5 http://demo.ovid.com/demo/ovidsptools/launcher.htm 6 https://github.com/dli1/tar_data_collection 7 https://www.ncbi.nlm.nih.gov/books/NBK25497/ Topic: CD009551 Title: Polymerase chain reaction blood tests for the diagnosis of invasive aspergillosis in immunocompromised people

For the construction of the qrel files, we considered the reference section of the 50 systematic reviews. The references are split into three categories: Included, Exclude, and Additional. Included are the studies that are relevant to the systematic review. Excluded are the studies that in the abstract and title screening stage were considered relevant, but at the article screening phase were considered irrelevant to the study and hence excluded from it. Additional are additional references that do not impact the outcome of the study, and hence irrelevant to it. The included references were the relevant studies at the document-level qrels, while both the included and excluded references were considered relevant at the abstract-level qrels. The format of the qrels followed the standard TREC format:

Topic Iteration Document Relevance where Topic is the topic ID of the systematic review, Iteration in our case is a dummy field always zero and not used, Document is the PMID, and Relevancy is a binary code of 0 for not relevant and 1 for relevant studies. The order of documents in the qrel files is not indicative of relevance. Studies that were returned by the Boolean query but were not relevant based on the above process, were considered irrelevant. Those are studies that were excluded at the abstract and title screening phase. All other documents in MEDLINE were also assumed to be irrelevant, given that they were not judged by the human assessor.

Important Note: Note that, as mentioned earlier, the references of a systematic review were produced after a number of Boolean queries were submitted to a number of medical databases, and their titles and abstracts were screened. The PMID’s provided however were only those that came out of the MEDLINE query. Therefore, there was a number of abstract-level relevant studies (the gray area in the Venn diagram below) that were not part of the result set of the Boolean query provided to the participants. For the development set, the qrel file contained those additional PMIDs, for those participants that would decide to search the entire MEDLINE database, and not only consider the studies provided to them in the Topic files. To the best of our knowledge, no one submitted such a system, hence to avoid any bias we excluded those relevant studies from the test set.

MEDLINE Boolean Query Relevant Studies

Table 1 shows the distribution of the relevant documents at abstract or document level for all the topics in the development set and the test set. The total number of unique PMID is 149,405 for the development set and 117,562 for the test set. Their percentages of relevant documents at abstract level are quite close, which is 1.88% for the development set and 1.58% for the test set. This is not true at document level, however, where the relevant documents in the test set is almost twice as large as in the development set, even though there are 0.52% and 0.33% of relevant studies, respectively. In [ 17 ], a test collection was developed based on a random selection of 93 Cochrane systematic reviews (not just DTAs), and reported a slightly higher rate of relevance ( 111549 = 1:2%). However, compared with the TREC campaign, the rate of relevant documents is 5.45%, 2.78% for the Adhoc track of TREC 8 and the Web track of TREC 2002. Overall, the number of relevant documents is not very high in this lab, making locating them quite a difficult task.

Important Note: As one can observe in Table 1, there are topics for which the output of the Boolean query is rather narrow, with as few as 64 studies to be reviewed for topic CD008760. Cochrane is conducting systematic reviews on a regular basis, in an attempt to update each review every two-three years. Some of the reviews considered for the construction of the benchmark collection, such file name

Topic 1 11 14 19 23 28 33 35 37 38 4 43 44 45 50 53 54 55 6 9 total as the CD008760 review, are updates to previous reviews. These updates, only specify a query for a time range that starts after the last review on the topic was conducted. Hence, the 64 studies, are the output of the Boolean query for this short time range, hence its small number. If the Boolean query were to run against the entire MEDLINE database, the number of studies would be in the range of tens of thousands, as is the case for some other reviews considered, e.g. CD008782. 3

Task Description

The CLEF 2017 e-Health Lab [ 8 ], task 2, focused on retrieving studies for conducting Diagnostic Test Accuracy (DTA) systematic reviews. Retrieval in this area is generally considered very difficult, where sensitive searches result in large quantities of references to be screened manually, and a breakthrough in this field would likely be applicable to other areas as well. The task has a focus on the second stage of the process, i.e. given the results of a Boolean search how to make abstract and title screening more effective and efficient. Currently a typical number needed to read (NNR), the number of studies to screen to identify 1 eligible study, for DTA systematic reviews is approximately 80 when applied to potential abstracts that need further full text assessment. With an average of 7000 results to be screened, which would take approximately 120 hours to screen (1 minute per abstract [ 18 ]), a huge benefit can be made in reducing the workload in this process.

Given the results of the Boolean search from stage 1 as the starting point, participants were asked to rank the set of the provided abstracts. The task had two goals: (i) to produce an efficient ordering of the documents, such that all the relevant abstracts are retrieved above the irrelevant ones, and (ii) to identify the relevant subset of abstracts to be shown to a user, that is a stopping point in the ranked list of abstract, where a researcher could confidently stop screening abstracts and titles. Therefore, we solicited two types of submissions: (i) ranking submission: automatic or manual methods that rank all abstracts, with the goal of retrieving relevant abstracts as early in the ranking as possible, and (ii) thresholding submission: thresholding can be performed in a batch, or iterative manner as well.

We also considered two evaluation frameworks, (a) a simple evaluation, and (b) a cost-effective evaluation. The assumption behind the simple evaluation framework is the following: The user of your system is the researcher that performs the abstract and title screening of the retrieved articles. Every time an abstract is returned (i.e. ranked) there is an incurred cost/effort of CA, while the abstract is either irrelevant (in which case no further action will be taken) or relevant (and hence passed to the next stage of document screening) to the topic under review. The assumption behind the cost-effective evaluation is the following: The user that performs the screening is not the end-user. The user can interchangeably perform abstract and title screening, or document screening, and decide what PMIDs to pass to the end-user. Every time an abstract is returned the user can either (a) read the abstract (with an incurred cost of CA) and decide whether to pass this PMID to the end-user, or (b) read the full document (with an incurred cost of CA+CD) and decide whether to pass this PMID to the end-user, or (c) directly pass the PMID to the end user (with an incurred cost of 0), or (d) directly discard the PMID and not pass it to the end user (with an incurred cost of 0). For every PMID passed to the end-user there is a cost of attached to it: CA if the abstract passed on is not relevant, and CA + CD if the abstract passed on is relevant (that is, we assume that the end-user completes a two-round abstract and document screening, as usual, but only for the PMIDs the algorithm+feedback user decided to be relevant). Although a small number of teams participated in the cost-effective sub-task, the lab focused on the simple evaluation sub-task, and this is what is described in the remaining of this report. 4

Evaluation

Evaluation within the context of using technology to assist in the reviewing process is very much dependent on how the user(s) interact with the system - and the goal of the technology assistance. For example, is the goal of the assistance to automate the screening process - where the system assess all the abstracts and returns a subset of the initial set to be screened by the end-user (i.e. screened in batch mode). Or, it could be used to identify all the relevant documents as soon as possible, in an iterative manner - where the system asks for feedback from the end-user to help improve the ranking. Of course, then the an open problem is decide when to stop requesting feedback, and when to stop assessing abstracts. In which case a subset of abstracts is identified, which consist of abstracts have been screened during the feedback cycles and the remainder that are screened but are not used for feedback (i.e. in batch mode). There are, of course, many other possible variations. For the purposes of this initial track/task, we consider the problem as a ranking task - that is to rank the set of documents associated with the topic in decreasing order of relevance. We consider a document relevant if the abstract passed the abstract screening phase (regardless of whether it was included or excluded from the study).

For this task we employ a number of standard measures, typically used in IR ranking evaluations, along with other measures from related tracks and some new measures we have developed.

– Standard Measures

Average Precision (AP) Normalized cumulative gain @ 0% to 100% of documents shown; for the simple case that judgments are binary, normalized cumulative gain @ % is simply Recall @ % of shown documents[10] Number of Relevant Found (nr) Recall r = nr=R, where R the total number of relevant documents Number of documents returned/shown (n) – Related Measures (from [6,5]

LOSS-R lossr = (1 r)2 LOSS-E losse = (n=(R + 100) 100=N )2, where N is the size of the collection Reliability = lossr + losse [6]

Work Saved over Sampling at r, W SS@Recall = (T N +F N )=N (1 r)[5] – Proposed Measures

Last Rel Found: Minimum number of documents returned to retrieve all nr relevant documents Total Cost (TC); Total Cost with Uniform penalty (TCU)

Total Cost with Weighted Penalty (TCW)

To calculate the cost based measured, we considered three possible interactions to support a range of different ways to screen the items and to utilize feedback when ranking. We consider the follow possibilities: 1. suppose we have an ranking algorithm, which uses no feedback from the user, simply ranks the list of abstracts. The list is then presented to the end-user, who evaluates them in a batch. In this case, no feedback is requested, and abstracted are marked, NF. 2. suppose we have a ranking algorithm which uses feedback (i.e. abstract(s) are presented to the user, feedback on their relevance is obtained, which is then used by the algorithm, thus simulating online feedback from the user). In this case, for each document where feedback from the users is requested, abstracts are marked AF, but if no feedback is requested it is marked NF. Abstracts marked NF, are then presented to the end-user to evaluate in a final batch. 3. for either above option, the algorithm may decided that an abstract is not relevant, and thus it does not need to be shown to a user, and so are marked NS.

To calculate the total cost (TC), we calculated:

T C = #N F:Ca + #AF:(Ca + Cf ) (1) where Ca is the cost of assessing the abstract, Cf is the cost of asking for feedback #N F is the number of NF items, #AF is the number of AF items.

We also created two additional cost measures which included a penalty for missing relevant abstracts (a) with a uniform penalty and (b) a weighted penalty. The uniform penalty was calculated as follows:

T CU = T C + (R r=R) (N n)

Cp (2) where Cp is the cost of the penalty of missing a relevant abstract, N is the total number of documents in the set for the topic. The assumption behind this penalty is that the end-user would need to continue examining abstracts before they would from the remaining (R r) relevant items, and encounters them at a uniform rate in the remaining N n abstracts which were not shown. So if half the relevant items were missing, then the penalty component would be (N n)Cp=2. If no relevant items were missing the penalty component would be zero.

The weighted penalty was calculated as follows:

(R r) T CW = T C + X (1=2i)(N n)

CP i=1 (3) where the assumption is that the end user would been to examine half of the remaining documents to find the next relevant abstract, per missing relevant abstract. So if all relevant items were missing, then the summation would tend to one, and the penalty component tends to (N n) Cp, while if only one relevant item is missing then, the penalty component is (N n) Cp=2.

To compute these measures we set Ca = 1,Cf = 2 and Cp = 2, to represent the relative costs of the different actions. Note that these are not based on any empirical data and used as a way to regulate penalize feedback and no shows. 5

Participants

Fourteen groups from eleven countries submitted a total of 68 runs for this task:

Table 2 categorizes the participating runs along five dimensions: (a) automatic vs manual runs; (b) use of the development set; (c) use of supervised and semi-supervised learning algorithms, (d) use of relevance feedback; and (e) thresholding the ranked list of articles. The categorization has been performed by the lab coordinators – not by the participants – based on the submitted participants description of their algorithms. Hence, there is always a chance of mis-classifying some run. Out of the 68 runs submitted, 52 focused on the simple Team

Run evaluation framework, while 16 on the cost-effective one. Out of the 52 submitted runs for the simple sub-task, 35 ranked all the PMIDs that were returned by the Boolean query, while 17 tested different stopping criteria over the ranking. Participants employed both supervised and unsupervised methods, for ranking articles. A large number of runs were trained over the provided development set, and their generalization was tested against the test topics. 26 runs used the development set in some fashion, while 26 made no explicit use of it; it may be the case that participants tried different models and algorithms over the development set, and selected to submit the best performing ones, hence there may be a flavor of model selection, however we did not consider this as use of the development set. Participants represented the textual data in a variety of ways, including document-topic features, bag-of-words, topic model distributions, embeddings, metadata. In the remainder of section, by article we mean the abstract and the title of an article. We are not aware of any participant that worked on the full text of these articles.

In particular, AMC took a batch supervised approach, training a Random Forest over a topic model representation of the articles. A 75-topic model was fitted over all articles in the collection, and the Topic-to-Document matrix was used to extract features [ 2 ].

AUTH took a learning-to-rank approach, using both batch and active learning. Their model, HybridRankSVM, consists of two parts: an inter-topic model which utilizes XGBoost and is trained over the entire development corpus and an intra-topic model, an iteratively-built SVM, trained over relevance feedback provided partially in the test topics. For the inter-topic model a total of 24 topicdocument (or solely topic) features were computed over the title, abstract and mesh terms of the articles and the query. For the intra-topic model a TF-IDF vectorization of the articles was used [ 3 ].

CNRS trained a logistic regression model on n-gram features from the titles and abstracts and structured data from the Medline citations. One of their models was trained using stochastic gradient descent on the majority of the features, and one on the principal components of a subset of the features. Class imbalance was handled by reweighting and undersampling, while two approaches for relevance feedback were investigated [13].

ECNU took a learning-to-rank approach, using BM25, PL2, and BB2 as features. The trained model was also combined with a vector space model [4].

ETH used a LAMBDA-Mart model trained on features, such as BM25, Fuzzy search, Vector content representation, publishing data. This model was used to experiment with different stopping criteria. One of the approaches taken was to use minimal relevance feedback to estimate the distribution of positive samples by score. This was done by sampling from the articles, preferring articles with higher score. A Gaussian distribution was fitted on the positive samples and the resulting biased distribution was corrected. The correction worked by first adapting the mean and then iteratively finding the standard deviation matching the sampled data the best. For more details the reader can refer to [9].

NCSU adopted a continuous active learning framework for this task. An SVM classifier was trained on the relevance feedback labels and undersampling of the negatively labeled articles removing those furthest from the SVM decision hyperplane was employed. Different runs made use of different weights on the labels depending on whether the abstract or the full text was considered relevant [ 20 ].

NTU examined the role of convolutional neural networks for classifying medical articles for systematic reviews [12].

Padua used a two-dimensional probabilistic version of BM25 to rank articles. The parameters were tuned using the development set. Further, the top abstract returned by BM25 was provided to two non-experts who generated one additional query each. The tree queries were then used to re-rank articles. Different approaches for relevance feedback and thresholding were investigated [14].

QUT trained a learning-to-rank model using domain specific features. As domain specific features, PICO annotations (Population, Intervention, Control, Outcome) were used; these were extracted automatically from articles and manually from the Boolean queries [ 16 ].

Sheffield automatically parsed the Boolean queries to extract both the terms and MeSH heading,s and used TF-IDF cosine similarity to calculate the similarity score between document title and abstracts [ 1 ].

UOS explored two methods: (i) topic models, where they used Latent Dirichlet Allocation to identify topics within the set of retrieved articles, and then ranking articles by the topic most likely to be relevant to the query, and (ii) relevance feedback, where they used Rocchio’s algorithm to update the query model for subsequent rounds of interaction. A third approach combined the topic model and relevance feedback approaches to quickly identify the relevant articles. For the thresholding task, they applied a score threshold over BM25 [11].

UCL took a supervised approach and trained a deep model architecture to identify studies pertaining to a given review topic [ 19 ].

Waterloo applied the Baseline Model Implementation (BMI) from the TREC Total Recall Track (2015-2016). They further applied their "knee-method" stopping criterion to BMI to determine how many abstracts should be examined for each topic [7]. 6

Results

Table 3 presents a number of evaluation measures for those runs that ranked the entire set of articles provided by the original Boolean queries; no thresholding has been applied. Some runs, as it may appear from Tables 7, 8, 9, 10, even though they applied no stopping criterion, still missed a number of documents. There may be multiple reasons for that, e.g. missing some topic, or not being able to download the abstract text, since participants were provided by PIDs only. The number of documents for which feedback was requested appears in the second column of the table, while the remaining of the columns report different measures of performance.

Figure 2 shows the recall-effort curves for the participating runs, that is the recall value at different percentage of documents shown to the user. The straight pink line with the triangular markers on x=y is the results of the Boolean query randomly shuffled, and it serves as a naive baseline, provided by the UOS team. The brown curve with the triangular markers is the BM25 retrieval function, also provided by the UOS team as a baseline; it ranks abstracts by BM25 over the Boolean query terms, with the default BM25 parameters setting.

Figure 3 presents the box-plots of Mean Average Precision values for runs that do not make use of relevance feedback (left) and runs that make use of relevance feedback (right) respectively. On average relevance feedback boosts amc.run auth.simple.run1 auth.simple.run2 auth.simple.run3 auth.simple.run4 BASELINE.BM25 BASELINE.pubmed.random cnrs.abrupt.all cnrs.gradual.all cnrs.noaf.all cnrs.noaffull.all ecnu.run1 ntu.run1 ntu.run2 ntu.run3 padua.iafa_m10k150f0m10 padua.iafap_m10p2f0m10 padua.iafap_m10p5f0m10 padua.iafas_m10k50f0m10 qut.ca_bool_ltr qut.ca_pico_ltr qut.rf_bool_ltr qut.rf_pico_ltr sheffield.run1 sheffield.run2 sheffield.run3 sheffield.run4 ucl.run_abstract ucl.run_fulltext uos.sis.TMAL30Q_BM25 uos.sis.TMBEST_BM25 waterloo.A-rank-normal waterloo.B-rank-normal the effectiveness of the ranking algorithms, as expected, however it may come with additional cost in terms of assessing the relevance of abstract (based on the screening setup considered). Table 4 presents a number of evaluation measures for those runs that applied a threshold criterion. The total number of shown to the user abstracts can be found in the second column of the table, the number of documents for which feedback was requested in the third, while the remaining of the columns report different measures of performance. The cost measures account both for the cost of presenting a document to the user and for the additional cost of requesting feedback for a document, while they also account for the cost one would need to pay to reach 100% recall, under certain assumptions. Reliability considers the cost of not finding all relevant documents but makes no discrimination between the documents returned to the user and those for which feedback is requested. Average precision is well defined under the stopping criterion but hard to be used for comparing runs that use different thresholds. An easy to understand measure is the achieved recall at the rank of the threshold.

Figure 5 presents recall at the point of the threshold as a function of the number of documents presented to the user; that is at different stopping criteria, but also with different ranking and thresholding algorithms. As expected the more documents presented to the user (the lower the threshold criterion) the higher the achieved recall. Nevertheless, there are still algorithms that dominate others. The figure present the Pareto frontier. Figure 5 presents recall at the point of the threshold as a function of the feedback documents requested. As it can be viewed, although feedback documents, are in principle helpful towards achieving a high recall, there are algorithms that used no relevance feedback and still achieved high recall at a threshold. Table 5 provides statistics on the topics used in the test set, along with the average Average Precision (AAP) for each topic, a measure that can be seen as a proxy of the difficult of each topic. The Pearson correlation coefficient between AAP and the percentage of relevant documents, the total number of documents, and the total number of relevant documents is -0.4868 (p-value = 0.006), 0.1295 (p-value = 0.495), and 0.8994 (p-value = 0). Figures 6 and 7 visually demonstrate this correlation.

Conclusions

The CLEF 2017 e-Health Lab Task 2 constructed a benchmark collection of 50 Diagnostic Test Accuracy systematic reviews to study the effectiveness and efficiency of information retrieval and machine learning algorithms in prioritizing the studies to be screened at the abstract and title screening stage, and providing a stopping criterion over the ranked list. The results demonstrate that automatic methods can be trusted for finding most, if not all, relevant studies in a fraction of the time manual screening can do the same. Given that across different runs many parameters change simultaneously it is not easy to come to certain conclusions about the relative performance of automatic methods.

Regarding the benchmark collection itself, there is a number of limitations to be considered: (a) Pivoting on the results of the the OVID MEDLINE Boolean query limits our ability to identify all relevant studies, i.e. relevant studies that are outputted by Boolean queries over different databases, and relevant studies that are actually not found by these Boolean queries. The former can be overcome by considering all the different queries submitted; for the latter extra manual judgments would be required. (b) Pivoting on abstract and title only we miss the opportunity to study the effect of automatic methods when applied to the full text of the studies, that would present an opportunity to completely overcome the multi-stage process of systematic reviews. However, most of the full text articles are protected under copyright laws that do not give all participants access to those. (c) The evaluation setup of ranking does not allows us to consider the cost of the process, since given a ranking a researcher would have to still go over all studies ranked. A more realistic setup, e.g. a double-screening setup, could be considered. (d) In the construction of relevant judgments we considered the included and excluded references of the systematic reviews under study, which prevented us to study the noise and disagreement between reviewers. (e) In our effort to allow iterative algorithms, e.g. active learning algorithms, to be submitted, we handed the test sets’ relevant judgments directly to the participants, which is rather unusual for this type of evaluation exercises. An alternative would be the setup used by the TREC Total Recall, where participants submitted their running algorithms to the organizers. (f) When it comes to evaluation measures there is a large variety of those, all of which take a different often useful view point on the effectiveness of algorithm, but which makes it difficult to decide upon a single golden measure to rank participants’ runs. - Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017) 3. Anagnostou, A., Lagopoulos, A., Tsoumakas, G., Vlahavas, I.: Hybridranksvm: A cost-effective hybrid ltr approach for document ranking. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017) 4. Chen, J., Chen, S., Song, Y., Liu, H., Wang, Y., Hu, Q., He, L.: Ecnu at 2017 ehealth task 2: Technologically assisted reviews in empirical medicine. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017) 5. Cohen, A.M., Hersh, W.R., Peterson, K., Yen, P.Y.: Reducing workload in systematic review preparation using automated citation classification. Journal of the American Medical Informatics Association 13(2), 206–219 (2006) 6. Cormack, G.V., Grossman, M.R.: Engineering quality and reliability in technologyassisted review. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 75–84. SIGIR ’16, ACM, New York, NY, USA (2016), http://doi.acm.org/10.1145/2911451. 2911510 7. Cormack, G.V., Grossman, M.R.: Technology-assisted review in empirical medicine: Waterloo participation in clef ehealth 2017. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017) 8. Goeuriot, L., Kelly, L., Suominen, H., Névéol, A., Robert, A., Kanoulas, E., Spijker, R., Palotti, J., Zuccon, G.: CLEF 2017 eHealth evaluation lab overview. In: CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS). Springer (September 2017) 9. Hollmann, N., Eickhoff, C.: Relevance-based stopping for recall-centric medical document retrieval. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017) 10. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques.

ACM Trans. Inf. Syst. 20(4), 422–446 (Oct 2002), http://doi.acm.org/10.1145/ 582415.582418 11. Kalphov, V., Georgiadis, G., Azzopardi, L.: Sis at clef 2017 ehealth tar task. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings, CEURWS.org (2017) 12. Lee, G.E.: Medical document classification for systematic reviews using convolutional neural networks: Sysreview at clef ehealth 2017. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017) 13. Norman, C., Leeflang, M., Neveol, A.: Limsi@clef ehealth 2017 task 2: Logistic regression for automatic article ranking. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017.

CEUR Workshop Proceedings, CEUR-WS.org (2017) 14. Nunzio, G.M.D., Beghini, F., Vezzani, F., Henrot, G.: An interactive twodimensional approach to query aspects rewriting in systematic reviews. ims unipd at clef ehealth task 2. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings, CEUR-WS.org (2017) 10.1002/14651858.CD010438.pub2/full 10.1002/14651858.CD010775.pub2/full 10.1002/14651858.CD009175.pub2/full 10.1002/14651858.CD011984/full 10.1002/14651858.CD009786.pub2/full 10.1002/14651858.CD008643.pub2/full 10.1002/14651858.CD009579.pub2/full 10.1002/14651858.CD009925/full 10.1002/14651858.CD009944.pub2/full 10.1002/14651858.CD007431.pub2/full 10.1002/14651858.CD007427.pub2/full 10.1002/14651858.CD008803.pub2/full 10.1002/14651858.CD008122.pub2/full 10.1002/14651858.CD009593.pub3/full 10.1002/14651858.CD008782.pub4/full 10.1002/14651858.CD009647.pub2/full 10.1002/14651858.CD009135.pub2/full 10.1002/14651858.CD008760.pub2/full 10.1002/14651858.CD011549/full 10.1002/14651858.CD009263.pub2/full 10.1002/14651858.CD009519.pub2/full 10.1002/14651858.CD009372.pub2/full 10.1002/14651858.CD011134.pub2/full 10.1002/14651858.CD010079.pub2/full 10.1002/14651858.CD010276.pub2/full 10.1002/14651858.CD008081.pub3/full 10.1002/14651858.CD009185.pub2/full 10.1002/14651858.CD011975/full 10.1002/14651858.CD009323.pub2/full 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 0 2 4 4 4 4 1 3 8 8 8 0 0 0 0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .7 .7 .9 .9 .8 .7 .7 .7 .7 .7 .8 .9 .9 .9 .0 .0 .0 .0 m 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 gd u j 3 0 9 8 9 7 4 4 5 4 8 9 6 7 2 1 1 6 6 6 6 1 1 8 6 5 82 .52 .72 .72 ce .1 .3 .2 .2 .2 .1 .0 .1 .1 .1 .1 .0 .1 .1 .2 .2 .2 .1 .1 .1 .1 .1 .1 .0 .0 .0 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n a v re dn cea .6 3 2 2 2 1 8 3 1 8 4 3 4 5 1 0 4 8 8 8 8 1 8 1 9 4 0 8 9 9 e 7 .9 .9 .9 .9 .8 .4 .7 .7 .7 .8 .6 .6 .6 .8 .8 .7 .6 .6 .6 .6 .6 .6 .6 .5 .5 .9 .8 .8 .8 le 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 r e /w ted ta 81 47 77 47 78 81 81 05 79 81 81 81 14 71 04 02 09 58 58 58 58 59 46 03 03 03 57 67 11 60 l-t v y e l h t g n 9 6 6 4 6 9 9 2 4 9 9 9 6 7 7 7 5 6 6 6 6 6 9 1 1 1 0 0 3 2 c s i o e e 3 6 6 5 6 3 3 5 5 3 3 3 6 6 4 4 5 6 6 6 6 7 6 4 4 4 4 4 4 4 a r .3 .6 . .2 .5 .5 .5 .5 .2 .0 .1 .2 .2 .4 .1 .0 .0 .2 .1 .1 .1 .1 .1 .1 .1 .1 .0 .1 .0 .4 .3 .4 .4 n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ru k leR scoD odnuF 7581 8571 8571 8571 8571 8571 8571 8571 8571 8571 8571 8571 1911 1971 6861 7021 4061 0061 0061 0061 0061 0731 6111 7951 7951 7951 8571 8571 8571 8571 frsou s t l 0 3

0 3 1 1 3 1 4 4 2 4

3 8 8 3 3 5 0 0 9 6 0 0 0 0 0 0 06 0 3 3 9 3 5 1 2 4 4 4 4 2 0 5 5 5 5 4 5 3 3 9 9 0 0 0 530 367 893 320 rseu 2 2 5 4 n o i 48 61 61 61 61 50 62 57 57 57 57 61 00 00 40 04 46 54 54 54 54 42 50 70 70 96 57 57 57 57 ta 5 5 5 5 5 5 5 5 5 5 5 5 0 0 6 6 0 3 3 3 3 9 9 1 1 1 5 5 7 7 7 7 7 7 7 7 7 7 7 7 0 0 1 1 7 5 5 5 5 2 7 1 1 1 11 11 11 11 11 11 11 11 11 11 11 11 3 3 5 5 2 1 1 1 1 1 2 11 11 11 11 11 11 11 v E 7 7 75 75 lau cea reh R T P A A U R C W

P 5 9 0 0 1 c a b d e e F 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0 4 4 5 5 5 5 3 8 5 5 5 4 4 4 4 un cm tuh tu u u A A rn rsn rsn rsn cnu cnu cn th th .t i.t ii.tru ii.t scu sc t tu tu adu adu a a R a a a a a B B c c c c e e e e e teh iii ii i i n n n n n p p p p r .

s t n e l t m o c d : 1 10 10 10 IT 0 m 0 f 0m f0m f0m R

A 0 f 5 2 5 0 P 1 p p 5k . k 0 0 0 7 0 1 1 1 e 1 m m m l m _ _ b _ p p s_ a

a a a T a f f a a i . a f a i . a u u d d i f a . a N N u

I I r m L L b r l l a . l l l u a l

a u . d f a 4 4 4 4 6 4 4 4 4 4 4 4 5 7 3 8 4 4 4 0 4 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . s t n 1 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 u j e b d e d e t e v /w ted ta 81 81 81 81 67 86 81 81 81 81 81 81 68 20 84 56 08 81 55 43 55 50 tc e l y l h to eg en 39 39 39 39 49 51 39 39 39 39 39 39 37 39 37 38 62 39 117 95 117 71 tra s i s

P C W a /w rm tayn 819 819 819 819 084 725 819 819 819 819 819 819 454 509 711 082 082 819 557 098 557 704 isgn

l o ts if o n e 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 6 3 11 8 11 6 u C U P t 9 9 7 9 1 3 2 9 7 9 3 3 0 5 7 9 3 2 0 0 0 6 u .2 .2 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 m o c s 0 1 9 3 0 1 1 9 8 0 4 4 8 2 3 7 0 7 0 6 1 4 n .2 .2 .1 .2 .1 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 b a l e R cea reh R T P A 0 0 1 e e F e 1 5 1 2 3 2 7 2 0 2 4 4 7 7 7 7 6 2 8 8 2 1 c .1 .1 .1 .1 .1 .1 .1 .2 .2 .2 .0 .0 .1 .1 .1 .1 .1 .1 .2 .2 .3 .3 n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 av e a 57117 57117 57117 57117 5169 1863 56211 56211 62511 62511 62511 62511 51010 4076 0484 7694 15117 57117 58117 7877 58117 69360 lavuE 5 5 5 5 9 0 7 7 7 7 7 7 3 1 7 9 5 5 5 6 5 : 2

u 58 67 58 369 rse 5 . l1d l2d l2d 52 BM M m on r

o o M _ B ro o h h B Q _ -n m o R o -n A h P s . k e 8 n r 5 l l I 2 5 l a l a I 2 a m a r

rm T _ _ 2 2 2 . h h h Q L E ra t t5 t5 t5_ 30L A B - _ _ M M .A .

T o

B .B a ts ts tr tr tr tr te te l l l l l_ _ _ oo ico l b p _ _ _ a a f . . t t t

r . l m m b .

b .

m is o o l o o T

o o l cea reh R T P A C W 5 9 0 0 1 s s k c a b d e e F .0 .0 . 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 d u j A U R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a v e v /w ted t 7 0 3 8 4 7 7 5 2 7 7 7 6 6 0 9 9 4 4 4 4 8 1 7 7 8 5 9 6 5 e y l l h a 7 9 9 1 0 7 7 5 4 7 7 7 8 8 7 8 3 9 9 9 9 7 9 5 5 5 3 2 5 6 tn to eg en 37 64 64 53 65 37 37 51 53 37 37 37 52 52 21 24 37 50 50 50 50 59 28 35 35 35 39 39 41 40 e s i m P c /w rm ty 7 0 3 8 4 7 7 5 2 7 7 7 9 9 6 6 0 3 3 3 3 4 1 6 6 7 5 9 6 5 d o l o a 7 9 9 1 0 7 7 5 4 7 7 7 6 6 7 9 5 8 8 8 8 5 9 0 0 0 3 2 5 6 C U P 4 4 4 4 4 4 4 4 4 4 4 4 9 9 4 8 5 8 8 8 8 1 3 2 2 2 4 4 4 4 .7 .7 . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d 1 2 1 2 1 7 7 9 6 0 4 7 0 1 8 5 6 9 9 9 9 8 6 2 0 4 1 9 0 2 e .5 .8 .8 .8 .8 .5 .0 .3 .4 .5 .6 .2 .3 .3 .5 .5 .3 .1 .1 .1 .1 .3 .6 .2 .2 .1 .7 .6 .7 .7 t 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 up d ts kn leR 7142 853 857 839 858 1664 3316 2619 2384 2263 1678 2905 515 486 1000 1050 596 501 501 501 501 356 960 2954 2779 3050 1055 1007 838 990 itt e a a L R m b le sco dnu 07 07 07 07 07 07 07 07 07 07 07 07 26 26 61 53 76 06 06 06 06 80 07 06 06 06 07 07 07 07 su R D oF 6 6 6 6 6 6 6 6 6 6 6 6 4 4 5 5 4 4 4 4 4 4 6 6 6 6 6 6 6 6 r o f . s t n e u m o c r s t l

0m f0m f0m TR 0 f 0 f 5 2 5 50 PA 1 p p k . k 0 0 0 9 0 1 1 1 1 m m m le m _ _ _ b _ p p s a a f a i . a u u i fa T a . a 0 4 8 8 9 2 1 2 0 3 9 9 2 9 3 3 2 3

2 3 3 5 0 0 9 6 0 0 0 0 0 0 66 0 2 2 4

6 6 2 0 u 2 2 6 9 0 0 0 82 20 49 17 se

2 2 5 4 r 4 4 4 4 2 0 3 3 3 3 8 5 n o 47 59 59 59 59 49 60 55 55 55 55 59 00 00 00 38 81 34 34 34 34 82 50 70 70 94 55 55 55 55 it a 5 5 5 5 5 5 5 5 5 5 3 2 2 2 2 6 9 1 1 1 5 5 u 9 9

9 9 9 9 95 95 95 95 9 9 90 90 7 6 5 5 5 5 5 2 7 3 3 3 95 95 9 9 l 10 10 10 10 10 10 10 10 10 10 10 10 2 2 4 4 2 1 1 1 1 1 2 01 10 10 10 10 10 10 av

5 1 2 3 4 2 m n n n n b r .B .p

E E t

p N N u a l a l

a u . d f l l a . l l

I I r a m L L b r s u ff 1 2 3 a a n n n g . g 0 0 0 0 1 8 0 0 0 0 0 0 9 8 9 0 0 0 0 0 0 8 d 1 1 1 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 e c n 8 1 6 9 0 9 2 8 5 8 4 4 4 3 4 6 1 9 9 9 3 3 a .0 .1 .0 .0 .1 .0 .1 .1 .1 .1 .0 .0 .1 .1 .1 .1 .1 .0 .1 .1 .2 .2 v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 le e r a 5 8 2 6 8 7 4 7 7 7 1 1 58 .58 .58 .09 .78 .67 .59 .59 .59 .49 l 7 .7 .7 .7 .6 .6 .8 .8 .8 .8 .5 .5 . e 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 v e t 6 0 5 8 5 3 4 8 9 9 2 2 7 6 7 9 7 9 1 1 2 7 u 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 m o to/w teegd tean 3777 3777 3777 3777 4014 4591 3777 3777 3777 3777 3777 3777 3460 2962 3073 6047 6035 3777 11333 8251 11333 6559 eum t y n l h s i

P /w rm tay 77 77 77 77 71 15 77 77 77 77 77 77 11 08 02 47 35 77 33 51 33 65 in g l s n 3 6 3 5 2 1 2 3 2 4 3 4 3 2 3 8 5 0 2 2 3 8 u 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d e t i ts kn leR 8252 3826 7724 1026 9515 2414 0118 2819 0219 4618 7234 0535 4815 8314 0615 690 8212 1980 461 461 469 375 bm a a L R u s le sco dnu 07 07 07 07 65 51 07 07 07 07 07 07 01 91 97 07 07 07 07 07 07 75 fo r R

D oF 6 6 6 6 4 4 6 6 6 6 6 6 6 5 5 6 6 6 6 6 6 5 s t l a 55 55 55 55 89 56 60 60 60 60 605 605 12 84 67 495 505 555 565 56 565 43 la u 59 59 59 59 35 48 59 59 59 59 9 9 7 5 9 10 10 10 10 6 5 01 10 10 10 10 10 95 73 80 109 109 109 109 797 109 529 :vE l _ _ 2 2 2 .

m m

A .

s o l o l 5 2 5 l a l a

2 a m a 5 . r l1d l2d l2d 52 BM M m on r

o o M _ B ro o h h B Q _ -n s 0

P sh .

n r h a h k e 0

1. Alharbi , A. , Stevenson , M. : Ranking abstracts to identify relevant evidence for systematic reviews: The university of sheffield's approach to clef ehealth 2017 task 2 . In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum , Dublin, Ireland, September 11-14 , 2017 . CEUR Workshop Proceedings, CEUR-WS.org ( 2017 )

2. van Altena, A.J.: Predicting publication inclusion for diagnostic accuracy test reviews using random forests and topic modelling . In: Working Notes of CLEF 2017

15.

'Mara-Eves , A. , Thomas , J. , McNaught , J. , Miwa , M. , Ananiadou , S.: Using text mining for study identification in systematic reviews: a systematic review of current approaches . Systematic reviews 4(1) , 5 ( 2015 )

16. Scells , H. , Zuccon , G. , Deacon , A. , Koopman , B. : Qut ielab at clef 2017 technology assisted reviews track: Initial experiments with learning to rank . In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum , Dublin, Ireland, September 11-14 , 2017 . CEUR Workshop Proceedings, CEUR-WS.org ( 2017 )

17. Scells , H. , Zuccon , G. , Koopman , B. , Deacon , A. , Geva , S. , Azzopardi , L. : A test collection for evaluating retrieval of studies for inclusion in systematic reviews . In: To appear in Proceedings of the 40th international ACM SIGIR conference on Research and development in Information Retrieval. ACM ( 2017 )

18. Shemilt , I. , Khan , N. , Park , S. , Thomas , J. : Use of cost-effectiveness analysis to compare the efficiency of study identification methods in systematic reviews . Systematic Reviews 5 ( 1 ), 140 (Aug 2016 )

19. Singh , G. , Marshall , I. , Thomas, J. , Wallace , B. : Identifying diagnostic test accuracy publications using a deep model . In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum , Dublin, Ireland, September 11-14 , 2017 . CEUR Workshop Proceedings, CEUR-WS.org ( 2017 )

20. Yu , Z. , Menzies , T. : Technologically assisted reviews in empirical medicine: Data balancing or reweighting . In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation forum , Dublin, Ireland, September 11-14 , 2017 . CEUR Workshop Proceedings, CEUR-WS. org (2017) s b

/w rm ty 8 4 7 4 7 8 8 0 7 8 8 8 3 6 6 6 7 0 0 0 0 9 5 6 6 7 5 6 1 6 a d

3 9 0 8 9 0 3 4 9 6 0 2 6 7 8 4 6 4 4 4 4 6 7 1 3 7 1 8 0 2 te t

ts kn leR 1342 3344 3099 3155 1972 1873 2678 2441 2404 2382 3727 3727 2503 1877 2068 2333 2305 3124 1464 1161 1469 914 bm i

D oF 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 s t l n 7 4 4 3 5 7 7 1 3 7 7 7 0 0 8 1 8 5 5 5 5 3 8 6 6 6 9 9 1 0 g

o n e 3 6 6 5 6 3 3 5 5 3 3 3 3 3 1 2 1 3 3 3 3 3 2 3 3 3 3 3 4 4 in s u