<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">CLEF 2018 Technologically Assisted Reviews in Empirical Medicine Overview</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Evangelos</forename><surname>Kanoulas</surname></persName>
							<email>e.kanoulas@uva.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Informatics Institute</orgName>
								<orgName type="institution">University of Amsterdam</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Dan</forename><surname>Li</surname></persName>
							<email>d.li@uva.nl</email>
							<affiliation key="aff0">
								<orgName type="department">Informatics Institute</orgName>
								<orgName type="institution">University of Amsterdam</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Leif</forename><surname>Azzopardi</surname></persName>
							<email>leif.azzopardi@strath.ac.uk</email>
							<affiliation key="aff1">
								<orgName type="department">Computer and Information Sciences</orgName>
								<orgName type="institution">University of Strathclyde</orgName>
								<address>
									<settlement>Glasgow</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Rene</forename><surname>Spijker</surname></persName>
							<affiliation key="aff2">
								<orgName type="department">Julius Center for Health Sciences and Primary Care</orgName>
								<orgName type="institution" key="instit1">Cochrane Netherlands</orgName>
								<orgName type="institution" key="instit2">UMC Utrecht</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">CLEF 2018 Technologically Assisted Reviews in Empirical Medicine Overview</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">866D4DD928DA74E9B6A6ED2FF12F8530</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Systematic Reviews</term>
					<term>Technology Assisted Reviews</term>
					<term>TAR</term>
					<term>Diagnostic Test Accuracy</term>
					<term>DTA</term>
					<term>PubMed</term>
					<term>Cochrane</term>
					<term>e-Health</term>
					<term>Information Retrieval</term>
					<term>Text Classification</term>
					<term>Evaluation</term>
					<term>Test Collection</term>
					<term>Benchmarking</term>
					<term>High Recall</term>
					<term>Active Learning</term>
					<term>Relevance Feedback</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Conducting a systematic review is a widely used method to obtain an overview over the current scientific consensus on a topic of interest, by bringing together multiple studies in a reliable, transparent way. The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying all relevant studies in an unbiased way both complex and time consuming to the extent that jeopardizes the validity of their findings and the ability to inform policy and practice in a timely manner. The CLEF 2018 e-Health Technology Assisted Reviews in Empirical Medicine task aims at evaluating search algorithms that seek to identify all studies relevant for conducting a systematic review in empirical medicine. The task had a focus on Diagnostic Test Accuracy (DTA) reviews, and consisted of two subtasks: 1) given a number of relevance criteria as described in a systematic review protocol, search a large medical database of article abstracts (PubMed) to find the studies to be included in the review, and 2) given the article abstracts retrieved by a carefully designed Boolean Query, prioritize them to reduce the effort required by experts to screen the abstracts for inclusion in the review. Seven teams participated in the task, with a total of 12 runs submitted for subtask 1 and 19 runs for subtask 2. This paper reports both the methodology used to construct the benchmark collection, and the results of the evaluation.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Evidence-based medicine has become an important pillar in health care and policy making. In order to practice evidence-based medicine, it is important to have a clear overview over the current scientific consensus. These overviews are provided in systematic review articles, that summarize all available evidence regarding a certain topic (e.g., a treatment or a diagnostic test). To write a systematic review, researchers have to conduct a search that will retrieve all the studies that are relevant to the topic. The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying relevant studies in an unbiased way both complex and time consuming to the extent that jeopardizes the validity of their findings and the ability to inform policy and practice in a timely manner. Hence, the need for automation in this process becomes of utmost importance. Finding all relevant studies in a corpus is a difficult task, known in the Information Retrieval (IR) domain as the "total recall" problem <ref type="bibr">[7]</ref>.</p><p>To this date, the retrieval of studies that contain the necessary evidence to inform systematic reviews is being conducted in multiple stages:</p><p>1. Identification: At the first stage a systematic review protocol, which describes the rationale, hypothesis, and planned methods of the review, is prepared. The protocol is used as a guide to carry out the review. Beyond other information, it provides the criteria that need to be met for a study to be included in the review. Further, a Boolean query that attempts to express these criteria is constructed by an information specialist. The query is then submitted to a medical database containing titles, abstracts, and indexing terms of a controlled vocabulary of medical studies. The result is a set, A, of potentially relevant studies. 2. Screening: At a second stage experts are screening the titles and abstracts of the returned set and decide which one of those hold potential value for their systematic review, a set D. If screening an abstract has a cost C a , screening all |A| abstracts has a cost of C a * |A|. 3. Eligibility: At a third stage experts are downloading the full text of the potentially relevant abstracts, D, identified in the previous phase and examine the content to decide whether indeed these studies are relevant or not. Examining a document has typically a larger cost than the cost of examining an abstract, C d &gt; C a . The result of the second screening is the set of studies to be included in the systematic review.</p><p>Unfortunately, the precision of the Boolean query is typically low, hence reviewers often need to manually examine many thousands of irrelevant titles and abstracts in order to identify a small number of relevant ones. Further, there is no guarantee that the Boolean query will retrieve all relevant studies, jeopardizing the validity of the reviews. To overcome some of the limitations of the Boolean search, researchers have been testing the effectiveness of machine learning and information retrieval methods. O'Mara-Eves et al. <ref type="bibr" target="#b22">[15]</ref> provide a systematic review of the use of text mining techniques for study identification in systematic reviews.</p><p>The focus of the CLEF 2018 e-Health Technology Assisted Reviews in Empirical Medicine (TAR), similar to last year <ref type="bibr" target="#b17">[10]</ref>, lies on Diagnostic Test Accuracy (DTA) reviews. Search in this area is generally considered the hardest, and a breakthrough in this field would likely be applicable to other areas as well <ref type="bibr" target="#b18">[11]</ref>.</p><p>The goal of the lab is to bring together academic, commercial, and government researchers that will conduct experiments and share results on automatic methods to retrieve relevant studies with high precision and high recall, and release a reusable test collection that can be used as a reference for comparing different retrieval and mining approaches in the field of medical systematic reviews.</p><p>This paper is organized as follows: Section 2 describes the two subtasks of the lab in detail, Section 3 describes the constructed benchmark collection, and Section 4 the evaluation measures used; in Section 5 we briefly describe the participating systems, and in Section 6 we discuss the results of the evaluation. Section 7 concludes the article.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Task Description</head><p>In this section we describe the two subtasks of the TAR lab, the input provided to participants for each one of the subtasks and the expected participant's output submitted to the lab for evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Subtask 1: No Boolean Search</head><p>Prior to constructing a Boolean Query researchers have to design and write a systematic review protocol that in detail defines what constitutes a relevant study for their review. In this experimental task of the TAR lab, participants are provided with the relevant pieces of a protocol, in an attempt to complete search effectively and efficiently by-passing the construction of the Boolean query.</p><p>In particular, for each systematic review that needs to be conducted (also referred to as topic in the IR terminology), participants are provided with the following input data: 1. topic ID; 2. the title of the review written by Cochrane experts; 3. parts of the protocol, which includes the Objective, the Type of Study, the Participants, the Index Tests, the Target Conditions, and the Reference Standards; 4. the PubMed database, provided by the National Center for Biotechnology Information (NCBI), part of the U.S. National Library of Medicine (NLM).</p><p>Participants are provided with 30 topics on Diagnostic Test Accuracy (DTA) reviews. For each one of these topics participants are asked to submit: (a) a ranked linked of PubMed articles, and (b) a threshold over this ranked list. Participant can submit up to 3 submissions ("runs"). A run is the output of the participants' algorithm for all the topics, in the form of a text file, with each line of the file following the format:</p><formula xml:id="formula_0">TOPIC-ID THRESHOLD PMID RANK SCORE RUN-ID</formula><p>Each line represents a PubMed article in the ranked list for a given topic, with RANK indicating the index of this article in the ranked list. TOPIC-ID is the id of the topic for which the document has been retrieved, and THRESHOLD is either 0 or 1, with 1 indicating that the given rank is the rank of the threshold. PMID is the PubMed Document Identifier of the article ranked at that position, SCORE is the score the algorithm gives to the article, and RUN-ID is an identifier for the submitted run. Participants are allowed to submit a maximum of 5,000 ranked PMIDs per topic, i.e. a total maximum of 150,000 lines per run.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Subtask 2: Title and Abstract Screening</head><p>Given the results of the Boolean Search from the first stage of the systematic review process as the starting point, participants are asked to rank the set of abstracts. The task has two goals: (i) to produce an the efficient ordering of the documents, such that all of the relevant abstracts are retrieved as early as possible, and (ii) to identify a subset which contains all or as many of the relevant abstracts for the least effort (i.e. total number of abstracts to be assessed).</p><p>In particular, for each systematic review that needs to be conducted (also refereed to as topic in the IR terminology), participants are provided with the following input data:</p><p>1. topic ID 2. the title of the review written by Cochrane experts; 3. the Boolean query manually constructed by Cochrane experts; 4. the set of PubMed Document Identifiers (PMID's) returned by running the query in MEDLINE.</p><p>Participants are provided with 30 topics on Diagnostic Test Accuracy (DTA) reviews, which are the same topics as those provided in subtask 1. As in subtask 1 participants are asked to submit: (a) a ranked linked of the PubMed articles in the given set, and (b) a threshold over this ranked list. Participant can submit up to 3 submissions, and the format of each submission follows the format of subtask 1 submissions. Further, given that subtask 2 was the main task of the CLEF 2017 e-Health Technology Assisted Reviews in Empirical Medicine <ref type="bibr" target="#b17">[10]</ref>, participants were allowed, if not encouraged, to also submit any of their 2017 system over the new 30 topics outputs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Benchmark Collection</head><p>In what follows we describe the collection of articles used in the task, the topics released to participants, and how they were developed, as well as the relevance labels used in the evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Articles</head><p>The collection used in the lab is PubMed Baseline Repository last updated on November 28, 2017, and available on the NCBI FTP site under the ftp:// ftp.ncbi.nlm.nih.gov/pubmed/baseline directories. PubMed comprises more than 27 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites. NLM produces a baseline set of MED-LINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. The complete baseline consists of files pubmed18n0001 through pubmed18n0928.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Topics</head><p>To construct the benchmark collection, the organizers of the task considered 30 systematic reviews on Diagnostic Test Accuracy already conducted by Cochrane researchers. These reviews are publicly available and can be found in the Cochrane Library<ref type="foot" target="#foot_0">4</ref> ; they can be identified by setting the topic filter in the library to "Diagnosis" and "Diagnostic Test Accuracy" and the stage filter to "Review".</p><p>At the time of the topic construction 88 such systematic reviews were available; 50 of them were used in the 2017 task <ref type="bibr" target="#b17">[10,</ref><ref type="bibr">6]</ref>, and out of the remaining 38, 30 were chosen to constitute the 2018 topic set. The 30 systematic reviews considered can be found in Tables <ref type="table" target="#tab_9">9 and 10</ref>. The tables provide the topic id, a substring of DOI of the document (e.g. the DOI for the topic ID CD008122 is 10.1002/14651858.CD008122.pub2), the title of the systematic review that corresponds to the topic, and the publication date.</p><p>Participants were provided with two sets of topics: (a) a development set, and (b) a test set. The development set consisted of 42 topics out of the 50 topics provided in the 2017 version of the lab. <ref type="foot" target="#foot_1">5</ref> The 50 topics released in 2017 were re-examined by our Cochrane information specialist, and co-author of this paper, Rene Spijker, and 8 of them were found not reliable for training or testing purposes, and hence removed from the development set. In particular, the search strategies used within these reviews had a different objective than the objective of the lab. For this task we set-out to use searches that are sensitive in nature to inform a specific question for one review. Some of the reviews we removed were part of an overarching project where one search query was used to inform multiple reviews. We believe that including these would not reflect our intended practice and would misinform the algorithms and strategies developed. Other reviews had a different approach where a local registry was built on a broad topic (dementia) which would inform the review and the MEDLINE search was only intended as a highly specific top-up search, again not the intended approach for this task. So the reviews themselves were reliable but the methods used deviated from this task making them unsuitable. The IDs of the 8 topics are the following: CD007431 (10), CD010772 (41), CD010775 (2), CD010896 (39), CD010771 <ref type="bibr" target="#b3">(45)</ref>, CD011145 (42), CD010783 (56), CD010860 (57), where in parenthesis is the filename of the topic in the 2017 release of the data.</p><p>Topic Description for Subtask 1: In subtask 1 each topic file was generated through the following procedure: First, the topic ID was extracted from the DOI of the systematic review. Then, the title of the systematic review was considered. Last, for each systematic review, the corresponding protocol was identified, and the objective of the review as described in the protocol was also considered. These three elements, topic ID, title and objective constitute the topic provided to participants. An example can be seen below: Cochrane DTA review titles follow a particular structure <ref type="bibr">[9]</ref> with a few alternatives. For instance, in the example above the title follows the structure: "Index test(s) for [target condition(s) in [participant description]". The objective of a DTA systematic review can be: (a) to make comparisons between tests concerning their global accuracy, (b) to estimate the accuracy of a test operating at a particular threshold, or (c) to understand why results of studies vary. In the example above the objective is to estimate accuracy. Furthermore, participants were provided with other relevant parts of the protocol and in particular, the secondary objectives, if any, the type of study, the participants, the index tests, the target conditions, the comparator tests, and the reference standards.</p><p>The description of these relevant parts of the protocol as described in the Cochrane Handbook for DTA reviews <ref type="bibr">[8]</ref> is can be found in the gray box below.</p><p>Types of studies: Identifiable design features of eligible studies must be stated. Review authors should describe the design as well as using a design name, as there is no universal terminology for diagnostic study designs. Key aspects include whether only prospective or both prospective and retrospective studies are to be included, to describe how and where participants were recruited (e.g. as a consecutive series of new presentations in primary care), and whether the study was cross-sectional or included longitudinal assessment for the reference standard. Authors should always state whether they included or excluded diagnostic case-control studies or the strategy used to make this decision. Any restrictions based on a minimal quality standard, minimal sample sizes, or numbers of diseased cases should be stated, but there is no clear guidance on how these limitations should be determined. In reviews that include comparisons between tests, alternative study designs which make within-study comparisons of tests may be sought, notably studies where all individuals receive all tests, and those where all individuals receive the reference standard but are randomized to receive different index tests. These latter studies should be described as randomized trials of test accuracy. Some reviews which compare tests may restrict study inclusion only to studies of these designs which make within-study comparisons, but others may include studies that evaluate one or other of the tests individually (particularly where few such published studies exist). Any such restrictions should be stated. Randomized trials of patient outcomes are rarely eligible for inclusion. They can only be included if individuals received both the index test and a reference standard -occasionally this information is available. Participants: Review authors should specify the participants for whom the test would be applicable, including any restrictions on diagnoses, age groups and settings. Planned subgroup analyses related to participant characteristics should not be listed here -they should be listed under the sources of heterogeneity in the secondary objectives. Index tests: Review authors should specify the test(s) to be evaluated in the review. If multiple tests are being reviewed and compared with each other details for each test should be given. In the first Cochrane DTA protocols and reviews tests were separated into new index tests or existing comparator tests. However it is often difficult to distinguish index from comparator tests and tests are no longer divided into these two categories. However, where it is clear that some tests are new experimental tests and others are existing standard comparative tests this should be noted. Target conditions: The target condition is a particular disease or disease stage that the index test is intended to identify. Some reviews may evaluate the ability of tests to differentiate between several target conditions -if this is the case, the multiple target conditions should all be listed here. Reference standards: Describe the clinical reference standards required to establish the presence or absence of the target condition in the tested population. If any reference standards are commonly used but considered inadequate this should be stated here as an exclusion criteria. If the review covers multiple target conditions, the reference standard for each should be stated.</p><p>Topic Description for Subtask 2: In subtask 2 each topic file was generated through the following procedure: For each systematic review, we reviewed the search strategy from the corresponding study in Cochrane Library. A search strategy, among other things, consists of the exact Boolean query developed and submitted to a medical database, at the time the review was conducted, and typically can be found in the Appendix of the study. Rene Spijker, a co-author of this work and a Cochrane information specialist examined the grammatical correctness of the search query and specified the date range which dictated the valid dates for the articles to be included in this systematic review. The date range was necessary because a study published after the systematic review should not be included even though it might be relevant, since that would require manually examining its content to quantify its relevance.</p><p>A number of medical databases, and search interfaces to these databases is available for search, and for each one information specialists construct a different variation of their query that better fits the data and meta-data of the database. For this task, we only considered the Boolean query constructed for the MEDLINE database, using the Wolters Kluwer Ovid interface. Then we submitted the constructed Boolean query to the OVID system at http: //demo.ovid.com/demo/ovidsptools/launcher.htm and collected all the returned PubMed document identification numbers (PMID's) which satisfied the date range constraint. This step was automated by a Python script we put together and through an interface available to the University of Amsterdam.</p><p>The topic file is in a text format and contains four sections, Topic, Title, Query, and PMID's. PMID's are the PubMed document IDs returned by the Boolean query. The PMIDs can be used to access the corresponding document through the National Center for Biotechnology Information (NCBI) <ref type="foot" target="#foot_2">6</ref> . An example of a topic file can be viewed below. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Relevance Labels</head><p>The original systematic reviews written by Cochrane experts included a reference section that listed Included, Excluded, and Additional references to medical studies. Included are the studies that are relevant to the systematic review.</p><p>Excluded are the studies that in the abstract and title screening stage were considered relevant, but at the article screening phase were considered irrelevant to the study and hence excluded from it. Additional are the studies that do not impact the outcome of the review, and hence irrelevant to it. The union of Included and Excluded references are the studies that were screened at a Title and Abstract level and were considered for further examination at a full content level. These constituted the relevant documents at the abstract level, while the Included references constituted the relevant documents at the full content level. The majority of the references included their corresponding PMID, but not all of them. For those references missing the PMID, the title was extracted from the reference, and it was used as a query to Google Search Engine over the domain https://www.ncbi.nlm.nih.gov/pubmed/. The top-scored document returned by Google was selected, and the title of the study contained in landing page, as identified in the metadata extracted. The title was compared then with the title of the study used as search query. If the Edit Distance between the two titles was up to 3 (just to account for spaces, parentheses, etc.) then the study reference was replaced by the PMID also extracted from the metadata of the landing page. If (a) the title had an edit distance greater than 3 but less than 20, or (b) the study was an included study, or (c) no title was contained in the Google result metadata, or (d) no Google results were returned, then the query was submitted at https://www.ncbi.nlm.nih.gov/pubmed/ and the results were manually examined. All other studies were discarded under the assumption that they are not contained in PubMed. The format of the qrels followed the standard TREC format: Topic Iteration Document Relevance where Topic is the topic ID of the systematic review, Iteration in our case is a dummy field always zero and not used, Document is the PMID, and Relevancy is a binary code of 0 for not relevant and 1 for relevant studies. The order of documents in the qrel files is not indicative of relevance. Studies that were returned by the Boolean query but were not relevant based on the above process, were considered irrelevant. Those are studies that were excluded at the abstract and title screening phase. All other documents in MEDLINE were also assumed to be irrelevant, given that they were not judged by the human assessor.</p><p>Note that, as mentioned earlier, the references of a systematic review were produced after a number of Boolean queries were submitted to a number of medical databases, and their titles and abstracts were screened. The PMID's provided however were only those that came out of the MEDLINE query. Therefore, there was a number of abstract-level relevant studies (the gray area in the Venn diagram below) that were not part of the result set of the Boolean query provided to the participants. Studies that were cited in the systematic review but did not appear in the results of the Boolean query were excluded from the label set for Subtask 2, but included for Subtask 1. Hence, the total number of relevant abstracts in the test set for Subtask 1 is 4,656, while in Subtask 2 it is 3,964; further the total number of relevant studies in Subtask 1 is 759, while for Subtask 2 it is 678.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MEDLINE Boolean Query</head><p>Relevant Studies  <ref type="table" target="#tab_3">2</ref> show the distribution of the relevant documents at abstract or document level for all the topics in the development set and the test set. The total number of unique PMID's released for the training set was 241,669 (an average of 5754 per topic) and for the test set 218,496 (an average of 7283 per topic). The average percentage of relevant documents at Abstract level in the training set is 3.8% of the total number of PMID's released, and in the test set 4.7%, while at the content level the average percentage is 1.5% in the training set, and 1% in the test set. In <ref type="bibr" target="#b24">[17]</ref>, a test collection was developed based on a random selection of 93 Cochrane systematic reviews (not just DTAs), and reported a slightly higher rate of relevance ( <ref type="formula">14</ref>1159 = 1.2%). However, compared with the TREC campaign, the rate of relevant documents is 5.45% and 2.78% for the Adhoc track of TREC 8, and the Web track of TREC 2002, respectively. Overall, the number of relevant documents is not very high in this lab, making locating them quite a difficult task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Evaluation</head><p>Evaluation within the context of using technology to assist in the reviewing process is very much dependent on how the users interact with the system, and on the goal of the technology assistance. For example, if the goal of the assistance is to autonomously predict which studies should be assessed by the end-user at a document level, then the problem can be viewed as a classification problem; the system screens all abstracts and returns a subset of them as relevant. If the goal of the assistance is to identify all the relevant documents as quick as possible but let the human decide when to stop screening, then the problem can be viewed as a ranking problem. There are, of course, many other possible variations. For the purposes of the 2018 lab, we consider the problem as a ranking problem -that is, to rank the set of documents associated with the topic in decreasing order of relevance.</p><p>Furthermore, the two subtasks although very similar in terms of evaluation, i.e. in both subtasks participants' runs are rankings of article, with a designated threshold, they also differ: in subtask 2 the set of articles to be prioritized contains all the relevant articles, while in subtask 1 the relevant articles need to be found within the entire PubMed database, and hence there is no guarantee that all relevant articles will appear in the top 5000. Further, in subtask 1, the length of the ranked lists vary significantly across different topics.</p><p>For the evaluation of runs employ a number of standard IR measures, along with measures that have been developed for the particular task of technology assisted reviews <ref type="bibr" target="#b11">[4,</ref><ref type="bibr">2]</ref>. A list of the used measures can be seen below:</p><p>-Subtask 1 1. Average Precision 2. Number of Relevant Found 3. Precision @ last relevant found 4. Recall @ rank k, with k in [50, 100, 200, 500, 1000, 2000, 5000]</p><p>5. Recall @ threshold -Subtask 2 1. Average Precision 2. Recall @ k % of top ranked abstracts, with k in <ref type="bibr" target="#b12">[5,</ref><ref type="bibr" target="#b17">10,</ref><ref type="bibr">20,</ref><ref type="bibr">30]</ref> 3. Work Saved over Sampling at recall r, W SS@r = (T N + F N )/N (1 − r)</p><p>[2] 4. Reliability = loss r + loss e <ref type="bibr" target="#b11">[4]</ref>, with loss r = (1 − r) 2 , where r is the recall at the threshold, and loss e = (n/(R + 100) * 100/N ) 2 , where n is the number of returned documents by the system up to the threshold, N is the size of the collection, and R the number of relevant documents.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Recall @ threshold</head><p>The lab organizers developed an evaluation software similar to trec_eval for the easy evaluation of the submitted runs, also provided to participants. The code of the tar_eval software is available at https://github.com/CLEF-TAR/tar.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Participants</head><p>The 2018 task received submissions from 7 teams, including one team from Canada (UWA), one team from the USA (UIC/OHSU), one team from the UK (Sheffield), one team from China (ECNU), one team from Greece (AUTH), one team from Italy (UNIPD), one team from France (Limsi-CNRS). The participating teams are: For the subtask 1, we received 12 runs from 4 teams. For the subtask 2, we received 19 runs from 7 teams.</p><p>The 7 teams used a variety of learning methods including batch supervised learning, continuous active learning, a variety of learning algorithms including logistic regression, support vector machines, and neural networks, as well as unsupervised retrieval methods, such as TT-IDF, BM25, with or without traditional relevance feedback methods, such as the Rocchio's Algorithm, and a variety of text representation methods including simple count-based methods and neural embeddings.</p><p>Tables <ref type="table" target="#tab_5">3 and 4</ref>   relevance feedback, and (e) the type of thresholding used. The categorization has been performed by the lab coordinators -not by the participants -based on the submitted participants description of their algorithms. Hence, there is always a chance of mis-classifying some run. In subtask 1 participants employed both supervised and unsupervised methods for ranking articles. A total of 5 runs were trained over the provided development set, and their generalization was tested against the test topics, while 7 made no explicit use of it; it may be the case that participants tried different models and algorithms over the development set, and selected to submit the best performing ones, hence there may be a flavor of model selection, however we did not consider this as use of the development set. Participants represented the textual data in a variety of ways, including document-topic features, bag-of-words, topic model distributions, embeddings, metadata. Out of the 19 runs submitted for subtask 2, 6 trained over the development set, 12 used the relevance feedback provided per topic, either at an abstract or content level, while 6 runs used a fixed threshold, 2 an automatic thresholding method, and the rest did not threshold the ranking at all. Below we provide a short description of the submitted runs for both subtask 1 and 2.</p><p>AUTH took a learning-to-rank approach, using both batch and active learning. Their model consists of two parts: an inter-topic model which utilizes XG-Boost and is trained over the entire development corpus (for subtask 1 it is 2500 articles returned by PubMed search, and for subtask 2 the articles provided by the organizers) and an intra-topic model, an iteratively-built SVM, trained over relevance feedback provided partially in the test topics. For the inter-topic model a total of 48 for subtask 1 and 31 for subtask 2 topic-document (or solely topic) features were computed over the title and the abstract of the articles and the query. For the intra-topic model a TF-IDF vectorization of the articles was used <ref type="bibr">[12]</ref>.</p><p>CNRS trained a logistic regression model on a large number (&gt; 500,000) of features over the development set. The logistic regression model is intended to capture features that are related to DTA studies independent of the topic. They further used an active learning approach which continuously learn to find relevant articles within each topic. A model that combines the two using a feedforward neural network was also used <ref type="bibr" target="#b20">[13]</ref>.</p><p>ECNU used the BM25 algorithm for subtask 1 to acquire a baseline. Furthermore, query expansion based on MeSH terms and pseudo relevance feedback (PRF) was used to improve the results. In sub-task 2, they employed Para-graph2Vector to represent query and documents for similarity calculation <ref type="bibr" target="#b25">[18]</ref>. UIC/OHSU first applied a clustering algorithm over a large number of PubMed articles to identify 6 publication types, including DTA studies, but also Randomized Controlled Trials, Cross-sectional Studies, Cross-over Studies, Cohort Studies, and Case-Control Studies. The clusters were then represented by a feature vector of the centroid with each article in the cluster represented by 300 weighted terms most associated to the words in the article. Then, each article in the provided dataset was compared to the 6 clusters and a number of similarity measures were computed. These were used as features to be used by an SVM to classify articles against the 6 clusters. <ref type="bibr">[3]</ref> UNIPD used a two-dimensional probabilistic version of BM25 to rank articles, using relevance feedback up to a certain number of articles shown to the user, and switched to a Naive Bayes classifier for the remaining of the articles until a fixed threshold point <ref type="bibr" target="#b21">[14]</ref>.</p><p>Sheffield used RAKE <ref type="bibr" target="#b23">[16]</ref> keyword extraction algorithm for subtask 1 to interpret protocols, extract keywords and form them into queries designed to retrieve relevant documents, while Apache Lucene was used as the IR engine. Their approach to subtask 2 was to enrich queries with terms designed to identify diagnostic test accuracy studies and also by making use of relevance feedback <ref type="bibr">[1]</ref>.</p><p>UWA applied the Baseline Model Implementation (BMI) from the TREC Total Recall Track (2015-2016) and the CLEF 2017 eHealth. They further applied their "knee-method" stopping criterion to BMI to determine how many abstracts should be examined for each topic. The difference between different submissions came from the selection of feedback to be used to retrain the model with the options being abstract-level, content-level, or manual feedback provided by the participants themselves <ref type="bibr" target="#b12">[5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Results</head><p>In this section we provide the results of the evaluation for both subtasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Subtask 1: No Boolean Search</head><p>Tables 5, and 6 provide the results of the evaluation for subtask 1 for a subset of the evaluation measures. All participats' runs are evaluated both against the document and the abstract level relevance labels. What is impressive in these results is that without putting any manual effort to construct a Boolean Query -a rather time-consuming and error-prone process -the best system achieves a 96.7% recall, missing only 25 Included studies out of all 759.</p><p>Figure <ref type="figure" target="#fig_2">1</ref> shows the box plots for Average Precision against the document level labels for each one of the participant's runs in Subtask 1, with the Mean Average Precision denoted by a blue dashed line in the box plot. Table <ref type="table">5</ref>. Average scores for the submitted runs in Subtask 1; relevance is considered at the document level, i.e. only Included studies are considered relevant. In total there are 759 studies that are Included in the 30 systematic reviews conducted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Run</head><p>Total Rel P@ MAP R@50 R@100 R@200 R@300 R@400 R@500 R@1000 R@2000 R@5000 R@k k Run Total Rel P@ MAP R@50 R@100 R@200 R@300 R@400 R@500 R@1000 R@2000 R@5000 R@k k Figure <ref type="figure" target="#fig_3">2</ref> shows the recall-effort curves for the participats' runs, that is the recall value at different percentage of documents shown to the user. The green curve with the square marker corresponds to the Oracle run, which achieves an optimal recall at the different effort levels.   for each topic, a measure that can be seen as a proxy for topic difficulty.    Figure <ref type="figure" target="#fig_11">8</ref> demonstrates the bar plot of average precision values per topic; the dashed blue line in the box plots designates the average Average Precision (AAP) for each topic, a measure that can be seen as a proxy for topic difficulty.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusions</head><p>The CLEF 2018 e-Health Lab Task 2 constructed a benchmark collection of 30 Diagnostic Test Accuracy systematic reviews to study the effectiveness and effi-  ciency of information retrieval and machine learning algorithms both in finding relevant articles in a large medical databse without explicitely constructing a Boolean query and in prioritizing the studies to be screened at the abstract and title screening stage, and providing a stopping criterion over the ranked list. The results demonstrate that automatic methods can be trusted for finding most, if not all, relevant studies in a fraction of the time manual screening can do the same. Further, many of the algorithms retrieved articles that were not in the results of the Boolean query, hence raising even concerns for the validity of the current practice in conducting systematic reviews. Given that across different runs many parameters change simultaneously it is not easy to come to certain conclusions about the relative performance of automatic methods.</p><p>Regarding the benchmark collection itself, there is a number of limitations to be considered: (a) Pivoting on the results of the the OVID MEDLINE Boolean query limits our ability to identify all relevant studies, i.e. relevant studies that are outputted by Boolean queries over different databases, and relevant studies that are actually not found by these Boolean queries. The former can be overcome by considering all the different queries submitted; for the latter extra manual judgments would be required. (b) Pivoting on abstract and title only we miss the opportunity to study the effect of automatic methods when applied to the full text of the studies, that would present an opportunity to completely overcome the multi-stage process of systematic reviews. However, most of the full text articles are protected under copyright laws that do not give all participants access to those. (c) The evaluation setup of ranking does not allows us to consider the cost of the process, since given a ranking a researcher would have to still go over all studies ranked. A more realistic setup, e.g. a double-screening setup, could be considered. (d) In the construction of relevant judgments we considered the included and excluded references of the systematic reviews under study, which prevented us to study the noise and disagreement between reviewers. (e) In our effort to allow iterative algorithms, e.g. active learning algorithms, to be submitted, we handed the test sets' relevant judgments directly to the participants, which is rather unusual for this type of evaluation exercises. An alternative would be the setup used by the TREC Total Recall, where participants submitted their running algorithms to the organizers. (f) When it comes to evaluation measures there is a large variety of those, all of which take a different often useful view point on the effectiveness of algorithm, but which makes it difficult to decide upon a single golden measure to rank participants' runs. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>1. Aristotle University of Thessaloniki, Greece (AUTH) 2. Centre Nationnal de la Recherche Scientifique, France &amp; Amsterdam Medical Center, The Netherlands (CNRS) 3. East China Normal University, China (ECNU) 4. University of Illinois College of Medicine, Chicago, Illinois, USA and Oregon Health &amp; Science University, Portland, Oregon, USA (UIC/OHSU) 5. University of Padua, Italy (UNIPD) 6. University of Sheffield, United Kingdom (Sheffield) 7. University of Waterloo, Canada (UWA)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>categorize the participating runs in the two subtasks along five dimensions: (a) automatic vs manual runs; (b) use of the development set; (c) use of supervised and semi-supervised learning algorithms, and (d) use of</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Average precision using the document level relevance judgments.</figDesc><graphic coords="18,134.77,214.80,345.83,326.49" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Recall at different top-k percentages of shown abstracts. Recall is computed using the abstract level relevance labels.</figDesc><graphic coords="21,134.77,197.48,345.83,298.62" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3</head><label>3</label><figDesc>Figure 3 presents the recall obtained by the participants' runs at the point of the threshold as a function of the number of documents presented to the user. As expected the more documents presented to the user (the lower the threshold) the higher the achieved recall. Nevertheless, there are still algorithms that dominate others. The figure present the Pareto frontier.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 4</head><label>4</label><figDesc>Figure 4 demonstrates the bar plot of average precision values per topic; the dashed blue line in the box plots designates the average Average Precision (AAP)for each topic, a measure that can be seen as a proxy for topic difficulty.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Recall at the threshold rank as a function of the number of documents shown to the user.</figDesc><graphic coords="22,134.77,229.69,345.82,287.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. Average Average Precision at document level relevance labels.</figDesc><graphic coords="23,134.77,216.79,345.83,322.51" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Fig. 6 .</head><label>6</label><figDesc>Fig. 6. Recall at different ranks.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Figure 7</head><label>7</label><figDesc>Figure7presents the recall obtained by the participants' runs at the point of the threshold as a function of the number of abstracts presented to the user. As expected the more abstract presented to the user (the lower the threshold) the higher the achieved recall. Nevertheless, there are still algorithms that dominate others. The figure present the Pareto frontier.Figure8demonstrates the bar plot of average precision values per topic; the dashed blue line in the box plots designates the average Average Precision (AAP) for each topic, a measure that can be seen as a proxy for topic difficulty.</figDesc><graphic coords="27,134.77,161.55,345.83,298.62" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>Fig. 7 .</head><label>7</label><figDesc>Fig. 7. Recall at the threshold rank as a function of the number of abstracts shown to the user.</figDesc><graphic coords="28,134.77,230.59,345.82,285.69" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_11"><head>Fig. 8 .</head><label>8</label><figDesc>Fig. 8. Average Average Precision at abstract level relevance labels.</figDesc><graphic coords="29,134.77,216.79,345.83,322.51" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="24,134.77,265.97,345.83,335.88" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1 .</head><label>1</label><figDesc>Statistics of topics in the development set. The total PMIDs are the ones retrieved by the Boolean Query, with the percentage of relevant articles also computed over this retrieved set.</figDesc><table><row><cell cols="6">Topic # total PMIDs # abs rel # doc rel % abs rel % doc rel</cell></row><row><cell></cell><cell></cell><cell cols="2">Development Set</cell><cell></cell><cell></cell></row><row><cell>CD010438</cell><cell>3250</cell><cell>39</cell><cell>3</cell><cell>1.20</cell><cell>0.09</cell></row><row><cell>CD007427</cell><cell>1521</cell><cell>123</cell><cell>17</cell><cell>8.09</cell><cell>1.12</cell></row><row><cell>CD009593</cell><cell>14922</cell><cell>78</cell><cell>24</cell><cell>0.52</cell><cell>0.16</cell></row><row><cell>CD011549</cell><cell>12705</cell><cell>2</cell><cell>1</cell><cell>0.02</cell><cell>0.01</cell></row><row><cell>CD011134</cell><cell>1953</cell><cell>215</cell><cell>49</cell><cell>11.01</cell><cell>2.51</cell></row><row><cell>CD008686</cell><cell>3966</cell><cell>7</cell><cell>5</cell><cell>0.18</cell><cell>0.13</cell></row><row><cell>CD011975</cell><cell>8201</cell><cell>619</cell><cell>60</cell><cell>7.55</cell><cell>0.73</cell></row><row><cell>CD009323</cell><cell>3881</cell><cell>122</cell><cell>9</cell><cell>3.14</cell><cell>0.23</cell></row><row><cell>CD009020</cell><cell>1584</cell><cell>162</cell><cell>12</cell><cell>10.23</cell><cell>0.76</cell></row><row><cell>CD011548</cell><cell>12708</cell><cell>113</cell><cell>5</cell><cell>0.89</cell><cell>0.04</cell></row><row><cell>CD011984</cell><cell>8192</cell><cell>454</cell><cell>28</cell><cell>5.54</cell><cell>0.34</cell></row><row><cell>CD010409</cell><cell>43363</cell><cell>76</cell><cell>41</cell><cell>0.18</cell><cell>0.09</cell></row><row><cell>CD008054</cell><cell>3217</cell><cell>274</cell><cell>41</cell><cell>8.52</cell><cell>1.27</cell></row><row><cell>CD009591</cell><cell>7991</cell><cell>144</cell><cell>41</cell><cell>1.80</cell><cell>0.51</cell></row><row><cell>CD008691</cell><cell>1316</cell><cell>73</cell><cell>20</cell><cell>5.55</cell><cell>1.52</cell></row><row><cell>CD010632</cell><cell>1504</cell><cell>32</cell><cell>14</cell><cell>2.13</cell><cell>0.93</cell></row><row><cell>CD007394</cell><cell>2545</cell><cell>95</cell><cell>47</cell><cell>3.73</cell><cell>1.85</cell></row><row><cell>CD008643</cell><cell>15083</cell><cell>11</cell><cell>4</cell><cell>0.07</cell><cell>0.03</cell></row><row><cell>CD009944</cell><cell>1181</cell><cell>117</cell><cell>64</cell><cell>9.91</cell><cell>5.42</cell></row><row><cell>CD008803</cell><cell>5220</cell><cell>99</cell><cell>99</cell><cell>1.90</cell><cell>1.90</cell></row><row><cell>CD008782</cell><cell>10507</cell><cell>45</cell><cell>34</cell><cell>0.43</cell><cell>0.32</cell></row><row><cell>CD009647</cell><cell>2785</cell><cell>56</cell><cell>17</cell><cell>2.01</cell><cell>0.61</cell></row><row><cell>CD009135</cell><cell>791</cell><cell>77</cell><cell>19</cell><cell>9.73</cell><cell>2.40</cell></row><row><cell>CD008760</cell><cell>64</cell><cell>12</cell><cell>9</cell><cell>18.75</cell><cell>14.06</cell></row><row><cell>CD009519</cell><cell>5971</cell><cell>104</cell><cell>46</cell><cell>1.74</cell><cell>0.77</cell></row><row><cell>CD009372</cell><cell>2248</cell><cell>25</cell><cell>10</cell><cell>1.11</cell><cell>0.44</cell></row><row><cell>CD010276</cell><cell>5495</cell><cell>54</cell><cell>24</cell><cell>0.98</cell><cell>0.44</cell></row><row><cell>CD009551</cell><cell>1911</cell><cell>46</cell><cell>16</cell><cell>2.41</cell><cell>0.84</cell></row><row><cell>CD012019</cell><cell>10317</cell><cell>3</cell><cell>1</cell><cell>0.03</cell><cell>0.01</cell></row><row><cell>CD008081</cell><cell>970</cell><cell>26</cell><cell>10</cell><cell>2.68</cell><cell>1.03</cell></row><row><cell>CD009185</cell><cell>1615</cell><cell>92</cell><cell>23</cell><cell>5.70</cell><cell>1.42</cell></row><row><cell>CD010339</cell><cell>12807</cell><cell>114</cell><cell>9</cell><cell>0.89</cell><cell>0.07</cell></row><row><cell>CD010653</cell><cell>8002</cell><cell>45</cell><cell>0</cell><cell>0.56</cell><cell>0.00</cell></row><row><cell>CD010542</cell><cell>348</cell><cell>20</cell><cell>8</cell><cell>5.75</cell><cell>2.30</cell></row><row><cell>CD010023</cell><cell>981</cell><cell>52</cell><cell>14</cell><cell>5.30</cell><cell>1.43</cell></row><row><cell>CD010705</cell><cell>114</cell><cell>23</cell><cell>18</cell><cell>20.18</cell><cell>15.79</cell></row><row><cell>CD010633</cell><cell>1573</cell><cell>4</cell><cell>3</cell><cell>0.25</cell><cell>0.19</cell></row><row><cell>CD010173</cell><cell>5495</cell><cell>23</cell><cell>10</cell><cell>0.42</cell><cell>0.18</cell></row><row><cell>CD009786</cell><cell>2065</cell><cell>10</cell><cell>6</cell><cell>0.48</cell><cell>0.29</cell></row><row><cell>CD010386</cell><cell>626</cell><cell>2</cell><cell>1</cell><cell>0.32</cell><cell>0.16</cell></row><row><cell>CD009579</cell><cell>6455</cell><cell>138</cell><cell>79</cell><cell>2.14</cell><cell>1.22</cell></row><row><cell>CD009925</cell><cell>6531</cell><cell>460</cell><cell>55</cell><cell>7.04</cell><cell>0.84</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2 .</head><label>2</label><figDesc>Statistics of topics in the test set. The total PMIDs are the ones retrieved by the Boolean Query, with the percentage of relevant articles also computed over this retrieved set.</figDesc><table><row><cell>Table 1 and Table</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3 .</head><label>3</label><figDesc>Categorization of participants' runs in subtask 1 along four dimensions.</figDesc><table><row><cell></cell><cell></cell><cell cols="2">Subtask 1: No Boolean Search</cell><cell></cell></row><row><cell>Run</cell><cell cols="5">Automatic Development Supervision Feedback Threshold</cell></row><row><cell>auth_run1</cell><cell></cell><cell></cell><cell></cell><cell>content</cell><cell>fixed</cell></row><row><cell>auth_run2</cell><cell></cell><cell></cell><cell></cell><cell>content</cell><cell>fixed</cell></row><row><cell>auth_run3</cell><cell></cell><cell></cell><cell></cell><cell>content</cell><cell>fixed</cell></row><row><cell>ECNU_RUN1</cell><cell></cell><cell>x</cell><cell>x</cell><cell>x</cell><cell>x</cell></row><row><cell>ECNU_RUN2</cell><cell></cell><cell></cell><cell></cell><cell>x</cell><cell>x</cell></row><row><cell>ECNU_RUN3</cell><cell></cell><cell></cell><cell></cell><cell>x</cell><cell>x</cell></row><row><cell>shef-bm25</cell><cell></cell><cell>x</cell><cell>x</cell><cell>x</cell><cell>x</cell></row><row><cell>shef-tfidf</cell><cell></cell><cell>x</cell><cell>x</cell><cell>x</cell><cell>x</cell></row><row><cell>shef-bool</cell><cell></cell><cell>x</cell><cell>x</cell><cell>x</cell><cell>x</cell></row><row><cell>UWA</cell><cell></cell><cell>x</cell><cell></cell><cell>abs</cell><cell>auto</cell></row><row><cell>UWG</cell><cell>x</cell><cell>x</cell><cell></cell><cell>manual</cell><cell>auto</cell></row><row><cell>UWX</cell><cell>x</cell><cell>x</cell><cell cols="2">manual &amp; abs</cell><cell>auto</cell></row><row><cell></cell><cell cols="4">Subtask 2: Title and Abstract Screening</cell></row><row><cell>Run</cell><cell cols="5">Automatic Development Supervision Feedback Threshold</cell></row><row><cell>auth_run1</cell><cell></cell><cell></cell><cell></cell><cell>content</cell><cell>fixed</cell></row><row><cell>auth_run2</cell><cell></cell><cell></cell><cell></cell><cell>content</cell><cell>fixed</cell></row><row><cell>auth_run3</cell><cell></cell><cell></cell><cell></cell><cell>content</cell><cell>fixed</cell></row><row><cell>cnrs_RF_uni</cell><cell></cell><cell>x</cell><cell cols="2">abs &amp; content</cell><cell>x</cell></row><row><cell>cnrs_RF_bi</cell><cell></cell><cell>x</cell><cell cols="2">abs &amp; content</cell><cell>x</cell></row><row><cell>cnrs_comb</cell><cell></cell><cell></cell><cell cols="2">abs &amp; content</cell><cell>x</cell></row><row><cell>ECNU_RUN1</cell><cell></cell><cell>x</cell><cell>x</cell><cell>x</cell><cell>x</cell></row><row><cell>ECNU_RUN2</cell><cell></cell><cell></cell><cell></cell><cell>x</cell><cell>x</cell></row><row><cell>ECNU_RUN3</cell><cell></cell><cell></cell><cell></cell><cell>x</cell><cell>x</cell></row><row><cell>unipd_t500</cell><cell></cell><cell>x</cell><cell></cell><cell>abs</cell><cell>fixed</cell></row><row><cell>unipd_t1000</cell><cell></cell><cell>x</cell><cell></cell><cell>abs</cell><cell>fixed</cell></row><row><cell>unipd_t1500</cell><cell></cell><cell>x</cell><cell></cell><cell>abs</cell><cell>fixed</cell></row><row><cell>shef-feed</cell><cell></cell><cell>x</cell><cell>x</cell><cell>abs</cell><cell>x</cell></row><row><cell>shef-general</cell><cell></cell><cell>x</cell><cell>x</cell><cell>x</cell><cell>x</cell></row><row><cell>shef-query</cell><cell></cell><cell>x</cell><cell>x</cell><cell>x</cell><cell>x</cell></row><row><cell>uic_model7</cell><cell></cell><cell>x</cell><cell></cell><cell>x</cell><cell>x</cell></row><row><cell>uic_model8</cell><cell></cell><cell>x</cell><cell></cell><cell>x</cell><cell>x</cell></row><row><cell>UWA</cell><cell></cell><cell>x</cell><cell></cell><cell>abs</cell><cell>auto</cell></row><row><cell>UWB</cell><cell></cell><cell>x</cell><cell cols="2">abs &amp; content</cell><cell>auto</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4 .</head><label>4</label><figDesc>Categorization of participants' runs in subtask 2 along four dimensions.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 6 .</head><label>6</label><figDesc>Average scores for the submitted runs in Subtask 1; relevance is considered at the abstract level, i.e. both Included and Excluded studies are considered relevant. In total there are 4656 studies that are identified as potential relevant during the title and abstract screening in the 30 systematic reviews conducted.</figDesc><table><row><cell></cell><cell cols="2">Rel Found Last</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>Rel</cell><cell></cell><cell></cell><cell></cell></row><row><cell>auth_run1</cell><cell>759</cell><cell>619 0.217 0.113 0.188 0.341 0.510 0.610 0.660 0.693</cell><cell>0.787</cell><cell>0.802</cell><cell>0.816 0.816 5000</cell></row><row><cell>auth_run2</cell><cell>759</cell><cell>619 0.217 0.113 0.188 0.341 0.510 0.610 0.660 0.693</cell><cell>0.787</cell><cell>0.802</cell><cell>0.816 0.809 2500</cell></row><row><cell>auth_run3</cell><cell>759</cell><cell>619 0.217 0.113 0.188 0.341 0.510 0.610 0.660 0.693</cell><cell>0.787</cell><cell>0.802</cell><cell>0.816 0.787 1000</cell></row><row><cell cols="2">ECNU_RUN1 759</cell><cell>426 0.118 0.072 0.170 0.242 0.339 0.393 0.431 0.472</cell><cell>0.561</cell><cell>0.561</cell><cell>0.561 0.472 500</cell></row><row><cell cols="2">ECNU_RUN2 759</cell><cell>310 0.080 0.041 0.076 0.145 0.216 0.281 0.340 0.378</cell><cell>0.408</cell><cell>0.408</cell><cell>0.408 0.378 500</cell></row><row><cell cols="2">ECNU_RUN3 759</cell><cell>426 0.109 0.072 0.173 0.246 0.341 0.411 0.452 0.485</cell><cell>0.561</cell><cell>0.561</cell><cell>0.561 0.485 500</cell></row><row><cell>shef-bm25</cell><cell>759</cell><cell>323 0.443 0.026 0.045 0.063 0.108 0.149 0.169 0.187</cell><cell>0.261</cell><cell>0.315</cell><cell>0.426 0.426 5000</cell></row><row><cell>shef-tfidf</cell><cell>759</cell><cell>202 0.523 0.002 0.005 0.005 0.017 0.029 0.042 0.057</cell><cell>0.086</cell><cell>0.126</cell><cell>0.266 0.266 5000</cell></row><row><cell>shef-bool</cell><cell>759</cell><cell>227 0.467 0.008 0.022 0.049 0.069 0.097 0.111 0.124</cell><cell>0.170</cell><cell>0.221</cell><cell>0.299 0.299 5000</cell></row><row><cell>UWA</cell><cell>759</cell><cell>727 0.225 0.124 0.256 0.428 0.592 0.693 0.771 0.806</cell><cell>0.912</cell><cell>0.947</cell><cell>0.958 0.951 3559</cell></row><row><cell>UWG</cell><cell>759</cell><cell>734 0.239 0.080 0.121 0.273 0.462 0.590 0.675 0.729</cell><cell>0.883</cell><cell>0.959</cell><cell>0.967 0.962 3611</cell></row><row><cell>UWX</cell><cell>759</cell><cell>727 0.221 0.154 0.254 0.386 0.564 0.673 0.743 0.784</cell><cell>0.884</cell><cell>0.950</cell><cell>0.958 0.951 3613</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 9 .</head><label>9</label><figDesc>The provided to participants set of testing topics (PART I).</figDesc><table><row><cell>Topic ID</cell><cell>Topic Title</cell><cell>Publication</cell></row><row><cell></cell><cell></cell><cell>Date</cell></row><row><cell>CD008122</cell><cell>Rapid diagnostic tests for diagnosing uncomplicated P.</cell><cell>2010/01/14</cell></row><row><cell></cell><cell>falciparum malaria in endemic countries</cell><cell></cell></row><row><cell>CD012599</cell><cell>First and second trimester serum tests with and with-</cell><cell>2011/08/25</cell></row><row><cell></cell><cell>out first trimester ultrasound tests for Down's syndrome</cell><cell></cell></row><row><cell></cell><cell>screening</cell><cell></cell></row><row><cell>CD009175</cell><cell>Clinical symptoms and signs for the diagnosis of My-</cell><cell>2012/06/26</cell></row><row><cell></cell><cell>coplasma pneumoniae in children and adolescents with</cell><cell></cell></row><row><cell></cell><cell>community-acquired pneumonia</cell><cell></cell></row><row><cell>CD009694</cell><cell>Computed tomography (CT) angiography for confirma-</cell><cell>2012/08/31</cell></row><row><cell></cell><cell>tion of the clinical diagnosis of brain death</cell><cell></cell></row><row><cell>CD009263</cell><cell>123I-MIBG scintigraphy and 18F-FDG-PET imaging for</cell><cell>2012/09/21</cell></row><row><cell></cell><cell>diagnosing neuroblastoma</cell><cell></cell></row><row><cell>CD010502</cell><cell>Rapid antigen detection test for group A streptococcus</cell><cell>2013/02/01</cell></row><row><cell></cell><cell>in children with pharyngitis</cell><cell></cell></row><row><cell>CD010680</cell><cell>Ankle brachial index for the diagnosis of lower limb pe-</cell><cell>2013/02/01</cell></row><row><cell></cell><cell>ripheral arterial disease</cell><cell></cell></row><row><cell>CD010864</cell><cell>D-dimer test for excluding the diagnosis of pulmonary</cell><cell>2013/12/12</cell></row><row><cell></cell><cell>embolism</cell><cell></cell></row><row><cell>CD011431</cell><cell>Rapid diagnostic tests for diagnosing uncomplicated non-</cell><cell>2013/12/31</cell></row><row><cell></cell><cell>falciparum or Plasmodium vivax malaria in endemic</cell><cell></cell></row><row><cell></cell><cell>countries</cell><cell></cell></row><row><cell>CD011602</cell><cell>Ultrasonography for diagnosis of alcoholic cirrhosis in</cell><cell>2015/01/31</cell></row><row><cell></cell><cell>people with alcoholic liver disease</cell><cell></cell></row><row><cell>CD011420</cell><cell>Lateral flow urine lipoarabinomannan assay for detecting</cell><cell>2015/02/28</cell></row><row><cell></cell><cell>active tuberculosis in HIV-positive adults</cell><cell></cell></row><row><cell>CD011686</cell><cell>Triage tools for detecting cervical spine injury in pedi-</cell><cell>2015/02/28</cell></row><row><cell></cell><cell>atric trauma patients</cell><cell></cell></row><row><cell>CD012179</cell><cell>Blood biomarkers for the non-invasive diagnosis of en-</cell><cell>2015/05/01</cell></row><row><cell></cell><cell>dometriosis</cell><cell></cell></row><row><cell>CD012281</cell><cell>Combination of the non-invasive tests for the diagnosis</cell><cell>2015/05/31</cell></row><row><cell></cell><cell>of endometriosis</cell><cell></cell></row><row><cell>CD011053</cell><cell>Imaging for the exclusion of pulmonary embolism in preg-</cell><cell>2015/07/28</cell></row><row><cell></cell><cell>nancy</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_9"><head>Table 10 .</head><label>10</label><figDesc>The provided to participants set of testing topics (PART II).</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0">http://www.cochranelibrary.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1">For subtask 1, two topics, CD011548 and CD011984, were not provided to participants, resulting in 40 training topics.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2">https://www.ncbi.nlm.nih.gov/books/NBK25497/</note>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Subtask 2: Title and Abstract Screening</head><p>Tables 7, and 8 provide the results of the evaluation for subtask 2 for a subset of the evaluation measures. All participats' runs are evaluated both against the document and the abstract level relevance labels, respectively. As one can observe, the best run can achieve a recall of 99.4% by reviewing 30% of the abstracts, i.e. missing 4 out of 678 Included studies, while by only reviewing 10% of the abstracts the best run can still achieve 90.6% recall.</p><p>Figure <ref type="figure">5</ref> shows the box plots for Average Precision against the abstract level labels for each one of the participants' runs in Subtask 2, with the Mean Average Precision denoted by a blue dashed line in the box plot. Table <ref type="table">7</ref>. Average scores for the submitted runs in Subtask 2; relevance is considered at the document level, i.e. only Included studies are considered relevant. In total there are 759 studies that are Included in the 30 systematic reviews conducted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Run</head><p>Total Avg MAP R@5% R@10% R@20% R@30% WSS95 WSS100 </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
	</analytic>
	<monogr>
		<title level="m">Topic: CD008122 Title: Rapid diagnostic tests for diagnosing uncomplicated P. falciparum malaria in endemic countries Query</title>
				<imprint>
			<biblScope unit="page">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title/>
	</analytic>
	<monogr>
		<title level="j">Exp Plasmodium</title>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title/>
		<author>
			<persName><surname>Malaria</surname></persName>
		</author>
		<author>
			<persName><surname>Ti</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title/>
	</analytic>
	<monogr>
		<title level="j">Exp Reagent kits, diagnostic</title>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m">rapid diagnos* test</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title/>
		<author>
			<persName><surname>Rdt</surname></persName>
		</author>
		<author>
			<persName><surname>Ti</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">*</forename><surname>Dipstick</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m">Rapid diagnos* device*.ti,ab 10. MRDD.ti,ab 11</title>
				<imprint/>
	</monogr>
	<note>OptiMal.ti</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Retrieving and ranking studies for systematic reviews: University of sheffieldâĂŹs approach to clef ehealth 2018 task 2</title>
		<author>
			<persName><forename type="first">Now</forename><surname>Binax</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alharbi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Briggs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stevenson</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2018 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting><address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">September 10-14, 2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Reducing workload in systematic review preparation using automated citation classification</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">R</forename><surname>Hersh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Peterson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">Y</forename><surname>Yen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="206" to="219" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Ohsu clef 2018 task 2 diagnostic test accuracy ranking using publication type cluster similarity measures</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">R</forename><surname>Smalheiser</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2018 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting><address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">September 10-14, 2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Engineering quality and reliability in technologyassisted review</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">V</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Grossman</surname></persName>
		</author>
		<idno type="DOI">10.1145/2911451.2911510</idno>
		<ptr target="http://doi.acm.org/10.1145/2911451.2911510" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval</title>
				<meeting>the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="75" to="84" />
		</imprint>
	</monogr>
	<note>SIGIR &apos;16</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Technology-assisted review in empirical medicine: Waterloo participation in clef ehealth</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">V</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Grossman</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2018 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting><address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-09-10">2018. September 10-14, 2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">CLEF 2017 ehealth evaluation lab overview</title>
		<author>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Suominen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Névéol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Robert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Spijker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R M</forename><surname>Palotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zuccon</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-319-65813-1_26</idno>
		<idno>-319-65813-1_26</idno>
		<ptr target="https://doi.org/10.1007/978-3" />
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction -8th International Conference of the CLEF Association, CLEF 2017</title>
		<title level="s">Proceedings. Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">G</forename><forename type="middle">J F</forename><surname>Jones</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Lawless</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Gonzalo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<meeting><address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">September 11-14, 2017. 2017</date>
			<biblScope unit="volume">10456</biblScope>
			<biblScope unit="page" from="291" to="303" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">TREC 2016 total recall track overview</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Grossman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">V</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roegiest</surname></persName>
		</author>
		<ptr target="http://trec.nist.gov/pubs/trec25/papers/Overview-TR.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of The Twenty-Fifth Text REtrieval Conference, TREC 2016</title>
				<editor>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Voorhees</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Ellis</surname></persName>
		</editor>
		<meeting>The Twenty-Fifth Text REtrieval Conference, TREC 2016<address><addrLine>Gaithersburg, Maryland, USA, November</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">15-18, 2016. 2016</date>
			<biblScope unit="page" from="500" to="321" />
		</imprint>
		<respStmt>
			<orgName>National Institute of Standards and Technology (NIST)</orgName>
		</respStmt>
	</monogr>
	<note>Special Publication</note>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Handbook for dta reviews</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D T A W</forename><surname>Group</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Higgins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Green</surname></persName>
		</author>
		<title level="m">Cochrane handbook for systematic reviews of interventions</title>
				<imprint>
			<publisher>John Wiley &amp; Sons</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="volume">4</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">CLEF 2017 technologically assisted reviews in empirical medicine overview</title>
		<author>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Azzopardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Spijker</surname></persName>
		</author>
		<ptr target="http://ceur-ws.org/Vol-1866/invited_paper_12.pdf" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2017 -Conference and Labs of the Evaluation Forum</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<meeting><address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">September 11-14, 2017. 2017</date>
			<biblScope unit="volume">1866</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Cochrane diagnostic test accuracy reviews</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Leeflang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Deeks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Takwoingi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Macaskill</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Systematic reviews</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">82</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Aristotle university&apos;s approach to the technologically assisted reviews in empirical medicine task of the 2018 clef ehealth lab</title>
		<author>
			<persName><forename type="first">A</forename><surname>Minas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lagopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Tsoumakas</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2018 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting><address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">September 10-14, 2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Limsi@clef ehealth 2018 task 2: Technology assisted reviews by stacking active and static learning</title>
		<author>
			<persName><forename type="first">C</forename><surname>Norman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Leeflang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neveol</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2018 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting><address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">September 10-14, 2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Interactive sampling for systematic reviews. ims unipd at clef 2018 ehealth task 2</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">M D</forename><surname>Nunzio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ciuffreda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Vezzani</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2018 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting><address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">September 10-14, 2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Using text mining for study identification in systematic reviews: a systematic review of current approaches</title>
		<author>
			<persName><forename type="first">A</forename><surname>O'mara-Eves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Thomas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mcnaught</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Miwa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ananiadou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Systematic Reviews</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">5</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Automatic keyword extraction from individual documents</title>
		<author>
			<persName><forename type="first">S</forename><surname>Rose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Engel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Cramer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Cowley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Text Mining: Applications and Theory</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="1" to="20" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">A test collection for evaluating retrieval of studies for inclusion in systematic reviews</title>
		<author>
			<persName><forename type="first">H</forename><surname>Scells</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zuccon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Koopman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Deacon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Geva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Azzopardi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th international ACM SIGIR conference on Research and development in Information Retrieval</title>
				<meeting>the 40th international ACM SIGIR conference on Research and development in Information Retrieval</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Ecnu at 2018 ehealth task 2: Technologically assisted reviews in empirical medicine</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>He</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2018 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting><address><addrLine>Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">September 10-14, 2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
