<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Data Balancing for Technologically Assisted Reviews: Undersampling or Reweighting</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Zhe</forename><surname>Yu</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">NC State University</orgName>
								<address>
									<postCode>27695</postCode>
									<settlement>Raleigh</settlement>
									<region>NC</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Tim</forename><surname>Menzies</surname></persName>
							<email>tim.menzies@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">NC State University</orgName>
								<address>
									<postCode>27695</postCode>
									<settlement>Raleigh</settlement>
									<region>NC</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Data Balancing for Technologically Assisted Reviews: Undersampling or Reweighting</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">1C76A84E910E12962FFF8649039C8727</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:31+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>technologically assisted reviews</term>
					<term>active learning</term>
					<term>data balancing</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper provides approaches for automated support of citation screening in systematic reviews. Continuous active learning is chosen as our baseline approach, above which, two data balancing techniques are applied to handle the imbalance problem. These two techniques, aggressive undersampling and reweighting are tested and compared on 20 data sets for Diagnostic Test Accuracy (DTA) reviews. Results are evaluated by last rel and suggest that reweighting outperforms undersampling as it not only balances the training data, but also emphasizes the "content relevant" examples over "abstract relevant" ones and thus helps to retrieve "content relevant" papers earlier.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>This paper is a participant working note for the task of technologically assisted reviews in empirical medicine <ref type="bibr" target="#b6">[7]</ref> in CLEF eHealth 2017 <ref type="bibr" target="#b5">[6]</ref>. This task is about applying machine learning techniques to facilitate medical researchers conducting systematic reviews. More specifically, the task focuses on Diagnostic Test Accuracy (DTA) reviews since search in this area is generally considered the hardest, and a breakthrough in this field would likely be applicable to other areas as well <ref type="bibr" target="#b6">[7]</ref>. Twenty DTA reviews data sets are provided for training and thirty for testing. The problem statement of this task is:</p><p>Given the results of a Boolean Search how to make Abstract and Title Screening more effective.</p><p>Here, in this paper, we further specify our problem to be:</p><p>Screen least amount of papers to retrieve most (or all) relevant ones.</p><p>This leads directly to the evaluation method-last rel <ref type="bibr" target="#b6">[7]</ref>, which measures the number of documents need to be screened before retrieving all relevant documents.</p><p>Previously, we analyzed the equivalent problem in software engineering (SE) and built a high performing method FASTREAD that combines a wide range of techniques taken from from electronic discovery and evidence based medicine <ref type="bibr" target="#b11">[12]</ref>. Those results suggested that FASTREAD, which took aggressive undersampling from patient active learning <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b9">10]</ref> and the rest from continuous active learning <ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref>, outperforms both of the original algorithms on SE reviews data <ref type="bibr" target="#b11">[12]</ref>. It indicated that, at least on SE reviews data, continuous active learning is an efficient approach, and data balancing can further improve its performance.</p><p>While the above results are promising, we advise against applying the conclusions directly to the empirical medicine task since the target corpus are very different (one from SE reviews and one from DTA reviews). In addition, the DTA reviews data have two levels of query results, one from title and abstract screening, the other from document screening, while the SE reviews data <ref type="bibr" target="#b11">[12]</ref> only have the query results from document screening. That is, we feel that when properly considered, reweighting can be another way to balance the training data while more weights are put on papers identified as "content relevant" over those identified as only "abstract relevant" or "not relevant". In this way, reweighting not only balances the two classes, but also favors "content relevant" examples when training the model.</p><p>Besides the two-level query results, DTA reviews data also offer a brief description of the topic being screened, which could be a great source for "Auto-Syn" described in <ref type="bibr" target="#b12">[13]</ref> and <ref type="bibr" target="#b2">[3]</ref>. Utilizing the description as an initial seed training example would provide better chance to retrieve "relevant" papers earlier and reduce variances in the experiments (comparing to a random start-up). Note that in order to train a classifier on just one "relevant" example (the description of the topic), Presumptive non-relevant examples are generated <ref type="bibr" target="#b2">[3]</ref>. This technique randomly samples from the unlabeled examples and treats the sampled examples as "not relevant" in training. The low prevalence of "relevant" examples makes this technique reasonable.</p><p>The rest of the paper provides details about different approaches tested on training data and analyzes the results. Numerous engineering decisions have been made without fully tested due to limited time. Followed by conclusions and future works at last.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Method</head><p>In this section, we provide details on three approaches:</p><p>-CAL: a baseline approach from Cormack et al. <ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref>.</p><p>-AU: add data balancing method called aggressive undersampling <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b9">10]</ref> to the baseline approach CAL. -RW: add reweighting method ("content relevant" papers weight more than other papers in training) to the baseline approach CAL.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Baseline: CAL</head><p>Besides the overall framework as continuous active learning <ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref>, the baseline approach applies several predefined engineering decisions same as our previous work <ref type="bibr" target="#b11">[12]</ref>. The entire work flow can be described as follows:</p><p>1. Corpus collection: collect titles and abstracts of papers in search results. 2. Auto-Syn: add the topic description into the corpus and label it as "abstract relevant". 3. Preprocessing: stemming, stop words removal, bag of words. 4. Featurization: term frequency, feature selection by tf-idf score (to 4000 terms), l2 normalization. 5. Training: train a binary classifier (linear SVM) on all the labeled papers, "content relevant" and "abstract relevant" papers are treated as one class-"relevant" while "not relevant" papers are the other class in the training.</p><p>Presumptive non-relevant examples are generated to enrich the "not relevant" class examples. 6. Certainty sampling: use the trained classifier to predict on the rest unlabeled papers. Sample N = 10 papers with highest probability to be "relevant" according to the classifier. 7. Review<ref type="foot" target="#foot_0">1</ref> : ask reviewers to review the sampled papers by titles and abstracts, label each as "abstract relevant" or "not relevant". For those papers labeled as "abstract relevant", reviewers are asked to further review on content and decide whether to label each as "content relevant". Go back to 5 until stop rule is satisfied (every "content relevant" paper has been retrieved). The threshold of M = 30 is applied to avoid training an SVM model on too few papers <ref type="bibr" target="#b11">[12]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Aggressive</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Reweighting: RW</head><p>Reweighting (RW) is a new approach which takes advantage of the two-level labels offered by DTA reviews data. The difference between RW and the baseline approach CAL is:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5.</head><p>Training: train a binary classifier (linear SVM) on all the labeled papers, "content relevant" and "abstract relevant" papers are treated as one class-"relevant" but "content relevant" papers have W = 10 times the weight of "abstract relevant" or "not relevant" ones. Presumptive non-relevant examples are generated to enrich the "not relevant" class examples.</p><p>The reweighting parameter of W = 10 is chosen quite arbitrarily without fully tested due to the limited time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experiment</head><p>Experiments are conducted in a "pseudo" way following the procedures in Section 2. When a paper is asked to be reviewed, its true label is queried without any real human review process. As a result, the experiments become repeatable and reproducible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Data</head><p>Twenty data sets on DTA reviews are provided as training sets for the task of technologically assisted reviews in empirical medicine <ref type="bibr" target="#b6">[7]</ref>. These data sets provide two-level query results, one for title and abstract screening and one for content screening. As a result, we label each paper in the data sets as one of the three classes:</p><p>-Not relevant: papers excluded by title and abstract screening. "Content" column displays the number of "content relevant" papers; "Abstract" column displays the number of "content relevant" papers plus the number of "abstract relevant" papers; "Total" column displays the total number of papers. Topic 1, 6, 19, 28, and 45 (colored in red ) are considered "not good" for last rel evaluation due to their lack of "content relevant" papers (fewer than 5).</p><p>-Abstract relevant: papers included by title and abstract screening but excluded by content screening. -Content relevant: papers included by title and abstract screening and content screening.</p><p>Statistics for the twenty data sets are presented in Table <ref type="table" target="#tab_1">1</ref> where five sets are considered to be "not good" for last rel evaluations. The reason behind is that pure "luck" might affect the result when the target is to retrieve the only 1 (or 2, or 3) "content relevant" paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Performance Metrics</head><p>Since the objective is to screen least amount of papers to retrieve most (or all) relevant ones, we choose last rel for evaluation. More specifically, we use the number of papers screened when every "content relevant" one is retrieved as the performance score to take advantage of the two-level labels offered by DTA reviews data. This makes our last rel metrics different from that used in <ref type="bibr" target="#b6">[7]</ref>.</p><p>The lower the last rel score is, the fewer papers need to be manually screened, thus the better performance. To capture the possible variances, experiments of each method on every data set (topic) is repeated 10 times with different random seeds (which affect the presumptive non-relevant examples generated and thus introduce variances). The last rel score for each repeat is collected while medians and iqrs (75th-25th percentile) are calculated for comparison. Scott-Knott <ref type="bibr" target="#b8">[9]</ref> analyses are applied on each topic to rank the performance of each treatment. Since the last rel scores are in asymmetric and non-normal distributions, Cliff's Delta <ref type="bibr" target="#b0">[1]</ref> and bootstrapping <ref type="bibr" target="#b4">[5]</ref> are applied for non-parametric hypothesis test; i.e. two treatments are ranked differently in Scott-Knott analysis if both bootstrapping and the effect size test agreed that the division is statistically significant (99% confidence) and not a small effect (Cliff's Delta ≥ 0.147).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Results</head><p>Table <ref type="table" target="#tab_2">2</ref> shows the results on 20 topics from the training set. The first thing we notice is that there is no treatment ranks highest (colored in green ) across every topic. One treatment may outperform others in one topic but performs poorly in another topic. In addition, no domination can be found among the three treatments (we say treatment A dominates treatment B if A performs consistently better than B across all topics).</p><p>Therefore, when it comes to the question of which treatment is the best, it really depends on the data. However, we did summarize the results in Table <ref type="table" target="#tab_2">2</ref> and count the number of "wins" and "losts" of each treatment. As shown in Table <ref type="table" target="#tab_3">3</ref>, statistically, reweighting (RW) wins more and loses less than any other treatment. As a result, among these three treatments, we recommend reweighting (RW), which over-weights the "content relevant" examples to balance training data as well as emphasize "content relevant" examples. Another gain from these experiments is that data balancing techniques do improve the performances. As indicated in Table <ref type="table" target="#tab_2">2</ref>, on 19 out of 20 (or 14 out of 15) topics, reweighting (RW) or aggressive undersampling (AU) ranks highest; on 13 out of 20 (or 9 out of 15) topics RW or AU ranks higher than continuous active learning (CAL) without data balancing. This also suggests that the ensemble of RW and AU to leverage the advantages from both data balancing techniques might offer even better results. We plan to explore this in our future works.</p><p>Variances are within an acceptably low range (except for some of the "not good" topics) thanks to "Auto-Syn" technique. Therefore the results are considered to be stable and repeatable.  <ref type="table" target="#tab_2">1 and 2</ref>). One treatment is considered better than another if the number in "Top Rank" is larger while the number in "Lower Rank thanBaseline" is smaller.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>How to retrieve most (or all) relevant documents by screening least amount of the candidate ones is a difficult problem which is also known in the Information Retrieval (IR) domain as the total recall problem. Proposed by Cormack et al. in 2014, continuous active learning has been an excellent algorithm to solve the problem <ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref>. It was also adopted as a baseline method in the total recall task of TREC 2015 <ref type="bibr" target="#b7">[8]</ref>. This work extended continuous active learning method by testing two different data balancing techniques. Experimental results suggested that there were no single treatment that outperforms any other treatment across all topics. However, statistically, reweighting (RW) was considered to be most powerful for the total recall task. This treatment applied "Auto-Syn" with topic description as seed training data, generated "presumptive non-relevant examples" before training to enrich the "not relevant" class, over-weighted the "content relevant" examples for data balancing. With the reweighting treatment, training examples were balanced (thus the model will not over-fit on "not relevant" class), and the model was trained to "favor" the "content relevant" examples which had a positive effect on retrieving every "content relevant" paper earlier.</p><p>Due to the limited time, only one aspect (data balancing) has been explored in this study. This does not imply that other aspects of the total recall task are not worth exploring. The plans of future work include:</p><p>-Explore the ensemble of reweighting and aggressive undersampling and other possible data balancing techniques. -Many parameters in the tested treatments are chosen quite arbitrarily. Parameter tuning can be applied to see if these parameters affect the conclusion and whether we can find a better set of parameters.</p><p>-Different featurization techniques can be applied to extract "richer" features than bag-of-words or term frequencies; e.g. word vectors and citation link features might be useful for measurement of relevance. -Human errors can be injected to test how robust the active learning methods are and to what level of error rate can the system perform normally.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 .</head><label>1</label><figDesc>Descriptive statistics for experimental data sets.</figDesc><table><row><cell></cell><cell cols="3">Content Abstract Total</cell><cell cols="3">Content Abstract Total</cell></row><row><cell>Topic1</cell><cell>2</cell><cell>30</cell><cell>3241 Topic35</cell><cell>9</cell><cell>98</cell><cell>3857</cell></row><row><cell>Topic4</cell><cell>28</cell><cell>442</cell><cell>8180 Topic37</cell><cell>12</cell><cell>154</cell><cell>1576</cell></row><row><cell>Topic6</cell><cell>2</cell><cell>6</cell><cell>15078 Topic38</cell><cell>5</cell><cell>109</cell><cell>12704</cell></row><row><cell>Topic9</cell><cell>60</cell><cell>98</cell><cell>1162 Topic43</cell><cell>27</cell><cell>48</cell><cell>43335</cell></row><row><cell>Topic11</cell><cell>8</cell><cell>59</cell><cell>1457 Topic44</cell><cell>30</cell><cell>206</cell><cell>3149</cell></row><row><cell>Topic14</cell><cell>20</cell><cell>63</cell><cell>14907 Topic45</cell><cell>1</cell><cell>42</cell><cell>316</cell></row><row><cell>Topic19</cell><cell>1</cell><cell>1</cell><cell>12704 Topic50</cell><cell>41</cell><cell>143</cell><cell>7990</cell></row><row><cell>Topic23</cell><cell>48</cell><cell>200</cell><cell>1938 Topic53</cell><cell>19</cell><cell>67</cell><cell>1310</cell></row><row><cell>Topic28</cell><cell>3</cell><cell>5</cell><cell>3964 Topic54</cell><cell>14</cell><cell>27</cell><cell>1499</cell></row><row><cell>Topic33</cell><cell>60</cell><cell>604</cell><cell>8186 Topic55</cell><cell>45</cell><cell>92</cell><cell>2542</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 .</head><label>2</label><figDesc>Experimental Results.</figDesc><table><row><cell></cell><cell></cell><cell>MEDIAN</cell><cell></cell><cell></cell><cell>IQR</cell><cell></cell></row><row><cell></cell><cell>RW</cell><cell>AU</cell><cell>CAL</cell><cell>RW</cell><cell>AU</cell><cell>CAL</cell></row><row><cell>Topic1</cell><cell>510</cell><cell>890</cell><cell>885</cell><cell>205</cell><cell>7</cell><cell>10</cell></row><row><cell>Topic4</cell><cell>260</cell><cell>410</cell><cell>385</cell><cell>70</cell><cell>62</cell><cell>122</cell></row><row><cell>Topic6</cell><cell>5475</cell><cell>12270</cell><cell>6055</cell><cell>327</cell><cell>7</cell><cell>440</cell></row><row><cell>Topic9</cell><cell>690</cell><cell>750</cell><cell>690</cell><cell>0</cell><cell>0</cell><cell>0</cell></row><row><cell>Topic11</cell><cell>75</cell><cell>90</cell><cell>80</cell><cell>17</cell><cell>10</cell><cell>7</cell></row><row><cell>Topic14</cell><cell>110</cell><cell>115</cell><cell>110</cell><cell>10</cell><cell>25</cell><cell>17</cell></row><row><cell>Topic19</cell><cell>8320</cell><cell>6160</cell><cell>8320</cell><cell>0</cell><cell>0</cell><cell>7</cell></row><row><cell>Topic23</cell><cell>920</cell><cell>840</cell><cell>1040</cell><cell>0</cell><cell>27</cell><cell>0</cell></row><row><cell>Topic28</cell><cell>1715</cell><cell>1525</cell><cell>1600</cell><cell>17</cell><cell>27</cell><cell>17</cell></row><row><cell>Topic33</cell><cell>4360</cell><cell>3780</cell><cell>4970</cell><cell>0</cell><cell>62</cell><cell>0</cell></row><row><cell>Topic35</cell><cell>210</cell><cell>260</cell><cell>405</cell><cell>27</cell><cell>10</cell><cell>115</cell></row><row><cell>Topic37</cell><cell>310</cell><cell>380</cell><cell>475</cell><cell>20</cell><cell>27</cell><cell>35</cell></row><row><cell>Topic38</cell><cell>490</cell><cell>960</cell><cell>980</cell><cell>15</cell><cell>447</cell><cell>97</cell></row><row><cell>Topic43</cell><cell>180</cell><cell>1140</cell><cell>230</cell><cell>37</cell><cell>210</cell><cell>17</cell></row><row><cell>Topic44</cell><cell>670</cell><cell>510</cell><cell>945</cell><cell>92</cell><cell>37</cell><cell>50</cell></row><row><cell>Topic45</cell><cell>20</cell><cell>10</cell><cell>10</cell><cell>15</cell><cell>7</cell><cell>7</cell></row><row><cell>Topic50</cell><cell>425</cell><cell>445</cell><cell>535</cell><cell>65</cell><cell>35</cell><cell>105</cell></row><row><cell>Topic53</cell><cell>340</cell><cell>620</cell><cell>280</cell><cell>0</cell><cell>60</cell><cell>0</cell></row><row><cell>Topic54</cell><cell>510</cell><cell>440</cell><cell>440</cell><cell>10</cell><cell>0</cell><cell>0</cell></row><row><cell>Topic55</cell><cell>740</cell><cell>850</cell><cell>610</cell><cell>0</cell><cell>17</cell><cell>0</cell></row><row><cell cols="7">Results collected from 10 repeated runs on 20 topics. Both medians and iqrs are lower</cell></row><row><cell cols="7">the better. For each topic, aggressive undersampling (AU) and reweighting (RW) are</cell></row><row><cell cols="7">compared along with the baseline method continuous active learning without data</cell></row><row><cell cols="7">balancing (CAL). Scott-Knott analyses (with Cliff's Delta and bootstrapping for non-</cell></row><row><cell cols="7">parametric hypothesis test) are applied to rank each treatment. The treatments with</cell></row><row><cell cols="7">highest rank are colored in green while the treatments with lower ranks than the</cell></row><row><cell cols="3">baseline (CAL) are colored in gray .</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 .</head><label>3</label><figDesc>Summary of the Experimental Results. Top Rank" column displays the number of times one treatment ranks highest while "Lower Rank thanBaseline" column displays the number of times one treatment ranks lower than baseline treatment (CAL). The first two columns count all 20 topics while the last two columns only count "good" topics (excluding topics colored in red in Table</figDesc><table><row><cell></cell><cell cols="2">In all 20 topics</cell><cell cols="2">In 15 "good" topics</cell></row><row><cell></cell><cell>Top Rank</cell><cell>Lower Rank than Baseline</cell><cell>Top Rank</cell><cell>Lower Rank than Baseline</cell></row><row><cell>RW</cell><cell>14</cell><cell>3</cell><cell>11</cell><cell>2</cell></row><row><cell>AU</cell><cell>9</cell><cell>6</cell><cell>6</cell><cell>5</cell></row><row><cell>CAL</cell><cell>7</cell><cell>NA</cell><cell>6</cell><cell>NA</cell></row><row><cell>"</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">The actual experiments are carried out without real human reviewers. When asked for labels, the true labels in the data sets are queried instead of a human reviewer.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Dominance statistics: Ordinal analyses to answer ordinal questions</title>
		<author>
			<persName><forename type="first">N</forename><surname>Cliff</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Psychological Bulletin</title>
		<imprint>
			<biblScope unit="volume">114</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page">494</biblScope>
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Evaluation of machine-learning protocols for technology-assisted review in electronic discovery</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">V</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Grossman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 37th international ACM SIGIR conference on Research &amp; development in information retrieval</title>
				<meeting>the 37th international ACM SIGIR conference on Research &amp; development in information retrieval</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="153" to="162" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Autonomy and reliability of continuous active learning for technology-assisted review</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">V</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Grossman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1504.06868</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Scalability of continuous active learning for reliable high-recall text classification</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">V</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Grossman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 25th ACM International on Conference on Information and Knowledge Management</title>
				<meeting>the 25th ACM International on Conference on Information and Knowledge Management</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="1039" to="1048" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">An introduction to the bootstrap</title>
		<author>
			<persName><forename type="first">B</forename><surname>Efron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Tibshirani</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1994">1994</date>
			<publisher>CRC press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Clef 2017 ehealth evaluation lab overview</title>
		<author>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Suominen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Névéol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Robert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Spijker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R M</forename><surname>Palotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zuccon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction -8th International Conference of the CLEF Association, CLEF 2017</title>
		<title level="s">Proceedings. Lecture Notes in Computer Science</title>
		<meeting><address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">September 11-14, 2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Overview of the CLEF technologically assisted reviews in empirical medicine</title>
		<author>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Azzopardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Spijker</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2017 -Conference and Labs of the Evaluation forum</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting><address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">September 11-14, 2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Trec 2015 total recall track overview</title>
		<author>
			<persName><forename type="first">A</forename><surname>Roegiest</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">V</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Grossman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Clarke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. TREC-2015</title>
				<meeting>TREC-2015</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A cluster analysis method for grouping means in the analysis of variance</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">J</forename><surname>Scott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Knott</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Biometrics</title>
		<imprint>
			<biblScope unit="page" from="507" to="512" />
			<date type="published" when="1974">1974</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Active learning for biomedical citation screening</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">C</forename><surname>Wallace</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Small</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">E</forename><surname>Brodley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">A</forename><surname>Trikalinos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining</title>
				<meeting>the 16th ACM SIGKDD international conference on Knowledge discovery and data mining</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="173" to="182" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Semiautomated screening of biomedical citations for systematic reviews</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">C</forename><surname>Wallace</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">A</forename><surname>Trikalinos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Brodley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">H</forename><surname>Schmid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC bioinformatics</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">1</biblScope>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">How to read less: Better machine assisted reading methods for systematic literature reviews</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Kraft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Menzies</surname></persName>
		</author>
		<idno>CoRR abs/1612.03224</idno>
		<ptr target="http://arxiv.org/abs/1612.03224" />
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Waterlooclarke: Trec 2015 total recall track</title>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Smucker</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>TREC</publisher>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
