<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Retrieval and Richness when Querying by Document</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eugene Yang</string-name>
          <email>eugene@ir.cs.georgetown.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ophir Frieder, David Grossman</string-name>
          <email>grossman@ir.cs.georgetown.edu</email>
          <email>ophir@ir.cs.georgetown.edu</email>
          <email>{ophir,grossman}@ir.cs.georgetown.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David D. Lewis</string-name>
          <email>desires2018paper@davelewis.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman Yurchak</string-name>
          <email>roman.yurchak@symerio.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cyxtera Technologies</institution>
          ,
          <addr-line>Dallas, TX</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Georgetown University</institution>
          ,
          <addr-line>Washington, DC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Symerio SAS</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>A single relevant document can be viewed as a long query for ad hoc retrieval, or a tiny training set for supervised learning. We tested techniques for QBD (query by document), with an eye toward their eventual use in active learning of text classifiers in a legal context. Richness (prevalence of relevant documents) varies widely in our tasks of interest. We used 658 categories from the RCV1-v2 collection to study the impact of richness on QBD variants supported by Elasticsearch. BM25 weighting on full query documents dominated other methods. However, its absolute and relative efectiveness depended strongly on richness, raising broader questions about common test collection practices. We ported Elasticsearch's version of BM25 to the machine learning package scikit-learn and we discuss some lessons learned about the replicability of retrieval results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Having a relevant item in hand, and desiring to find others, is a
common information access task. A Query By Document (QBD)
functionality, sometimes referred to as More Like This, Related
Documents, Similar Documents, or Recommendations is common
in both standalone search software and as search functionality in
other applications [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        Our interest in QBD, however, comes from another direction. In
legal applications such as electronic discovery and corporate
investigations, active learning [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] is used for both supervised learning of
text classifiers and for machine learning-supported interactive
annotation of datasets (finite population annotation or FPA) [
        <xref ref-type="bibr" rid="ref48 ref5 ref6 ref9">5, 6, 9, 48</xref>
        ].
Iterative relevance feedback (training on top-ranked documents)
[
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] is the most widely used version of active learning. This holds
particularly for FPA in the law, where iterative relevance feedback
is sometimes known as Continuous Active Learning or CAL1 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
1Grossman and Cormack have filed trademark applications for these terms.
2
Query By Example (QBE) capabilities have been explored for a
range of data types, including text (see below), database records
[
        <xref ref-type="bibr" rid="ref52">52</xref>
        ], voice [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], music [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], images [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], and video [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. We use
the term Query By Document (QBD) in discussing QBE where the
query is an entire document [
        <xref ref-type="bibr" rid="ref49">49</xref>
        ].
      </p>
      <p>
        Early work on QBD for text treated it as relevance feedback
[
        <xref ref-type="bibr" rid="ref1 ref46">1, 46</xref>
        ]. QBD is more dificult, however, since relevance feedback
has available both a query and at least one sample document [
        <xref ref-type="bibr" rid="ref49">49</xref>
        ].
The query terms provide both additional content and a form of
regularization, which is particularly critical with small training sets
[
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. QBD is likewise more dificult than ad hoc retrieval. Even
when consisting of user-selected terms, verbose queries typically
provide poorer efectiveness than shorter ones [
        <xref ref-type="bibr" rid="ref16 ref7">7, 16</xref>
        ]. QBD involves
not just verbose queries, but ones that have not benefited from user
term selection.
      </p>
      <p>
        We distinguish QBD from near-duplicate detection [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
plagiarism detection [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], and related tasks. While some of the same
techniques are used, in QBD the goal is retrieval of documents
with related meaning, not just documents that have an edit-based
historical connection.
      </p>
      <p>
        The largest body of QBD research is in evaluation campaigns for
the patent domain. Both patent applications and patent documents
are used as queries to search patents, technical articles, and portions
thereof [
        <xref ref-type="bibr" rid="ref13 ref31 ref34 ref44">13, 31, 34, 44</xref>
        ]. This literature almost uniformly assumes
that query reformulation (particularly query reduction by dropping
most terms) is necessary [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ]. The same assumption is found in
most non-patent QBD work [
        <xref ref-type="bibr" rid="ref49">49</xref>
        ].
      </p>
      <p>
        Only a few studies used full documents, or at least sections, as
queries [
        <xref ref-type="bibr" rid="ref14 ref47">14, 47</xref>
        ]. Of these, only one, to our knowledge, compared
basic retrieval models, finding BM25 dominated other methods [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
Query reduction has obvious benefits for eficiency in querying an
inverted file retrieval system. However, no such eficiency benefit
exists with typical machine learning software architectures,
motivating us to look at full document querying more systematically.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>RICHNESS, RETRIEVAL, AND QBD</title>
      <p>Past research on ad hoc retrieval, including QBD, has in two ways
assumed a narrow range of richness values.</p>
      <p>
        First, many ad hoc retrieval algorithms make modeling
assumptions that imply low richness. Probabilistic retrieval methods,
including BM25, derive inverse document frequency (IDF) weights
from the assumption that the collection contains no relevant
documents [
        <xref ref-type="bibr" rid="ref11 ref35 ref36">11, 35, 36</xref>
        ]. Many language modeling retrieval approaches
treat a query as generated by single draw from a mixture
distribution over documents [
        <xref ref-type="bibr" rid="ref51">51</xref>
        ]. This is equivalent to assuming that there
is a single relevant document in the collection.
      </p>
      <p>
        Second, a narrow range of richness values is usually imposed
in test collection construction. Queries with too low or too high
richness are typically discarded, and no more than a few thousand
documents are assessed for queries [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Richness of documents
coded relevant is thus forced to fall in a narrow range, while the
actual richness with respect to the simulated user need typically
remains unknown. The patent collections used in QBD studies have
a similar problem, given their use of patent citations (which are
deliberatedly bounded and incomplete) as ground truth.
      </p>
      <p>
        The one exception is test collections produced for research on
electronic discovery in the law. Some of these collections have
purportedly complete relevance judgments [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or stratified samples
that allow rough estimates of richness [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Further some topics
have relatively high richness. The number of such topics in these
collections is quite small, however, and no studies of the impact of
richness in these collections has been made.
      </p>
      <p>
        The situation is very diferent in supervised learning. Commonly
available test data sets for machine learning vary widely in class
imbalance (richness in the binary case), and the impact of class
imbalance on classification has been the subject of much research
[
        <xref ref-type="bibr" rid="ref20 ref24">20, 24</xref>
        ]. The impact of class imbalance on common supervised
learning algorithms is despite the fact that most of these methods
treat the two classes (in a binary problem) symmetrically. One might
expect the degree of class imbalance, i.e. richness, to have an even
stronger efect on ad hoc retrieval methods since these methods do
make assumptions about richness and asymmetry.
      </p>
      <p>
        The possible importance of richness was anticipated in some
very early work on ad hoc retrieval. Salton studied the impact of
generality (what we call richness here) on precision, finding that
precision decreased with generality [
        <xref ref-type="bibr" rid="ref41 ref42">41, 42</xref>
        ]. These results,
however, were based on either (a) comparing diferent collections of
documents, or (b) altering the definition of relevance (e.g. number
of agreeing assesors) on a single collection. Both these approaches
introduce conflating factors that make interpreting Salton’s results
unclear. Further, the collections used were tiny by modern
standards (at most 1400 documents), and richness variations were not
large (at most a factor of 7). Lewis and Tong drew on Salton’s results
in studying the impact of text classification components on
information extraction systems, but did not examine ad hoc retrieval
[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
      <p>
        As a terminological matter, Robertson defined "generality" to
have the same meaning we give "richness" here [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]. The term
generality, however, has been used ambiguously in the information
retrieval literature, however, and is more commonly used to refer
to breadth of meaning. We therefore use the term richness, which
has emerged in e-discovery.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>METHODS</title>
      <p>Our interests in QBD, the impact of richness, and in adapting ad
hoc retrieval methods for supervised learning were reflected in our
methodological choices.
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Dataset</title>
      <p>
        Our experiments used the RCV1-v2 text categorization test
collection [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. The collection contains 804,414 documents that have been
completely assessed for 823 categories from three groups (Topics,
Industries, and Regions). We used the 658 categories with 25 or
more relevant documents.
      </p>
      <p>As QBD queries for each category, we selected 25 documents by
simple random sampling from that category’s positive examples.
The definition of relevance for each query was membership in the
category, and thus richness was simply the proportion of the
collection labeled with that category. Document vectors were prepared
using the original XML version of each document2. We extracted
text from the title, headline, dateline, and text subelements,
concatenating them (separated by whitespace) for input to tokenization
(discussed below).</p>
      <p>No stop words, stemming, phrase formation, or other linguistic
preprocessing was used. This reflected our interest in applying QBD
techniques in machine learning systems that may not include a
broad range of text analysis options.
4.2</p>
    </sec>
    <sec id="sec-5">
      <title>Software</title>
      <p>We present ad hoc retrieval results produced using two open source
software packages. One was Elasticsearch, a distributed search
engine built on top of Lucene. We used version 6.2.2 which
incorporates Lucene 7.2.1. Our second set of results was produced using
version 0.19.1 of scikit-learn, an open source package for machine
learning, along with our modifications to that code.</p>
      <p>Some care was required to compare results from these two
systems. Supervised learning software is designed to apply a model to
every object of interest (documents for us). Every documents gets
a score, but the system is silent on ranking and tiebreaking. Search
software, on the other hand, guarantees a ranking (total order) for
retrieved documents, using implicit tiebreaking when scores are
tied. However, only a subset of documents may get scores, and even
fewer may be retrieved and ranked.</p>
      <sec id="sec-5-1">
        <title>2http://trec.nist.gov/data/reuters/reuters.html</title>
        <p>To allow comparison, we forced Elasticsearch to retrieve all
documents that had any term in common with a query, by setting
the index.max_result_window to a number that is larger than the
size of collection. We output the resulting scores and document IDs,
then assigned a score of 0 to all unretrieved documents. For
scikitlearn, we took the dot product of each document vector with the
query vector, thus producing a score (possibly 0) for each document.
Then for both Elasticsearch and scikit-learn runs, all documents
were sorted on score, with ties broken using the MD5 hash of the
document ID. Elasticsearch’s implicit tiebreaking was not used. The
total orderings were input to our evaluation routines.</p>
        <p>Experiment framework and scripts are published on Github3 for
replicability.
4.3</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Evaluation</title>
      <p>
        A single run applied an ad hoc retrieval algorithm on a QBD query
to produce a total ordering of the collection. Since the query itself
is a document in the collection, we used residual collection
evaluation [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], omitting the query document from the ranking before
evaluation. Thus each query is evaluated on a collection of 804,413
documents from which only itself is omitted.
      </p>
      <p>We chose residual precision @ rank k (P @k), for values of k from
1 to 20, as the primary efectiveness measure. This reflects the use of
QBD in interactive systems, including iterative relevance feeedback
approaches to active learning, where the top of the ranking is of
primary concern. We also computed residual R-precision, i.e., P @k
where k is one less than the number of relevant documents for that
category. The latter measures the ability of a method to achieve
high recall with QBD.</p>
      <p>Our interest was less in any particular query document, than
in the overall dificulty of QBD on that category. Therefore, for
each value of k, we averaged the P @k values across the 25 query
documents for each category to get a category-level average.</p>
      <p>Then to summarize the impact of richness on efectiveness, we
further took the mean of category-level average efectiveness across
groups of categories with similar richness. These richness bins were
formed by rounding the logarithm (base 2) of richness to the nearest
integer, and grouping together all categories that rounded to the
same integer. The frequencies of our categories ranged from 0.465
(bin number -1) to our enforced lower cutof of 25/804414 = 0.00003
(bin number -15). To ensure a minimum of 50 categories per bin,
bins -1 to -6 were combined into a single bin, as were bins -14 and
-15.
5</p>
    </sec>
    <sec id="sec-7">
      <title>ELASTICSEARCH EXPERIMENTS</title>
      <p>Elasticsearch is widely used, open source, and provides explicit
support for QBD. It was therefore a natural choice for our first
experiments.
5.1</p>
    </sec>
    <sec id="sec-8">
      <title>Retrieval Methods</title>
      <p>Elasticsearch supports QBD through its More Like This (MLT)
option. MLT converts a query document to a disjunctive (OR) query
using (by default) up to 25 terms from the query document. Also
by default, a term must occur at least twice in the query document
to be selected, and must occur in at least five documents in the</p>
      <sec id="sec-8-1">
        <title>3https://github.com/eugene-yang/DESIRES18-QBD-Experiments</title>
      </sec>
      <sec id="sec-8-2">
        <title>USA: U.S. weekly cash lumber review - April 4.</title>
        <p>U.S. weekly cash lumber review - April 4.</p>
        <p>Random Lengths Gross List Cash Lumber Prices Quotes: (2x4
Std&amp;Btr) WEEKLY MIDWEEK PREV WK YR AGO Inland
Hem-Fir 450 440 435 - Southern Pine Westside 470 470 470
Western Spruce-Pine-Fir 390 383 372 - Framing Lumber
Composite 441 - 432 354 COMMENT - Trading started the
week on a fairly active note and then picked up steam when
it became clear there was no Canadian "wall of wood"
waiting to cross the border at the start of a new quota
shipments year on April 1, Random Lengths said. Prices
slanted upward, with momentum gaining toward week’s end.
Apart from a nasty snowstorm in the Northeast, improving
weather across the country boosted outbound shipments
from dealer yards. Many buyers expressed surprise, and a
little frustration, with how quickly mills cleaned up floor
stock and extended order files to mid-April or beyond. While
dealers stuck mostly to highly specified truckload purchases,
wholesalers and distributors showed more willingness to own
wood. Secondaries who took small long positions turned
them fairly quickly as some dealers scrambled to cover needs
they had delayed purchasing earlier. The need for stock put a
premium on prompt shipments, favoring distribution yards
and reloads. ...
collection (including the query document). Terms that satisfy these
criteria are ranked by TFxIDF value (see description of vector space
retrieval below) and by default the top 25 are selected. We retained
all default settings in our experiments.</p>
        <p>For comparison, we also formed disjunctive queries from query
documents by taking the OR of all terms in the query, and executed
those queries using the Simple Query String (SQS) option. The
only diference between the MLT and SQS runs were that the SQS
runs used all terms in the query document, while MLT runs uses
a subset of those terms. Figure 1 shows of a portion of one
RCV1v2 document. The corresponding full document retrieves 803,940
documents when a disjunctive query is formed from all terms, but
only 204 documents when MLT is used to produce a reduced query.</p>
        <p>
          MLT and SQS queries both support ranking the retrieved
documents using any of several ad hoc retrieval methods. We tried one
version each of Okapi BM25 probabilistic retrieval [
          <xref ref-type="bibr" rid="ref38 ref39">38, 39</xref>
          ], vector
space retrieval [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ], and language model based retrieval [
          <xref ref-type="bibr" rid="ref45 ref50 ref51">45, 50, 51</xref>
          ].
        </p>
        <p>
          For BM25 a document is a vector of saturated TF weights
produced by a function that incorporates document length
normalization [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]. We used the default Elasticsearch values of b=0.75
and k1=1.2 for the BM25 parameters. A BM25 query is a vector
of probabilistic-style IDF weights, optionally multiplied by within
query weights [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]. Elasticsearch uses raw query term frequency
weighting by default and we retained that.
        </p>
        <p>For VSM retrieval, document vectors are produced by
multiplying a TF (within document term frequency) component and an
IDF component for each term, and then normalizing for document
length. We used raw term frequency, the smoothed IDF version
provided by Elasticsearch (Section 6.2), and L2 document length
normalization. L2 normalization (also known as cosine
normalization in information retrieval) divides the TFxIDF weights by the
square root of the sum of their squares, giving a Euclidean (L2)
norm of 1.0 for all document vectors. This is the classic retrieval
model in Elasticsearch.</p>
        <p>For LM based retrieval we used Dirichlet smoothing with µ =
2000, the default value from Elasticsearch.4
5.2</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Results and Discussion</title>
      <p>Table 1 shows the macroaveraged P@5 and R-Precision values for
the retrieval models and query variants in Elasticsearch.</p>
      <p>For every richness bin and retrieval model, queries based on a
subset of terms (Elasticsearch MLT) were no better (and usually
worse) than using all terms from the QBD document in the query.
This contradicts assumptions in the QBD literature that query
reduction is critical for good efectiveness. Admittedly, the query
reduction method implemented in Elasticsearch is less
sophisticated than some proposed in the research literature, and may be
motivated more by its impact on eficiency than on efectiveness.
That said, Elasticsearch is widely used, often with defaults
unexamined and unchanged.</p>
      <p>Both VSM and LM retrieval are based on a notions of query/document
similarity, with the latter treating documents as probability
distributions from which the query might be generated. A view of retrieval
as similarity might seem natural for QBD, and even more so in our
simulated setting where query documents are selected at random
rather than by a user with intention.</p>
      <p>It is notable, therefore, that Elasticsearch’s BM25 default model
dominates its VSM and LM default models for almost all richness
conditions and both query forms. There have been numerous
published language model variants, and it is plausible that one would
do better than BM25 on our dataset, and on QBD in general. But our
results, at least, cut against simplistic notions that QBD is simply
similarity matching.</p>
      <p>Our most striking result was the sharp, nearly monotonic decline
in absolute efectiveness with declining richness for all retrieval
models and both query types. The monotonic decrease in
efectiveness appears not only for the recall-oriented R-precision metric,
but even for P @5 (and all choices of P @k for k = 1 to 20).</p>
      <p>Figure 2 shows a scatter plot of richness vs P @20 for BM25
similarity using the full disjunctive (SQS) query on all 658
categories. The plot shows the strong correlation between richness
and efectiveness, though also a substantial category-to-category
variation.</p>
      <p>
        The Pearson (linear) correlation between the base 2 logarithm of
richness and residual P@20 is 0.58.5 For comparison, the maximum
Pearson correlation of 22 query efectiveness prediction methods
studied by Hauf, Hiemstra, and de Jong was 0.52 across a set of
datasets [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Richness with respect to a query is of course not
known in operational settings, so richness is not a practical
preretrieval efectiveness prediction cue. What this comparison does
      </p>
      <sec id="sec-9-1">
        <title>4https://www.elastic.co/blog/language-models-in-elasticsearch</title>
        <p>5We used P @20 instead of P @5 for increased granularity in the figure, but correlations
are similar at all depths.
show is the power of richness as a confounding factor in
studying efectiveness, compared to other query characteristics that are
commonly viewed as important.</p>
        <p>
          Most IR researchers and practitioners would agree that low
richness makes retrieval more dificult. Yet, the routine success of web
search engines at very low richness tasks has perhaps led to a
certain complacency. It is notable that overviews of the TREC HARD
track (which focused on retrieval for dificult ad hoc queries) do not
even mention richness as a contributor to low efectiveness[
          <xref ref-type="bibr" rid="ref2 ref3 ref4">2–4</xref>
          ].
        </p>
        <p>Richness also afects relative efectiveness, though less strongly.
The rank ordering of the six methods is largely the same across
richness bins. However, the relative diference between the best
and worst P@5 scores for the six methods increases from 6% for
the highest richness bin to 15% for the lowest richness bin. Thus,
as richness decreases the choice of retrieval method becomes more
consequential. This has obvious implications for trading of cost
versus complexity in operational systems.</p>
        <p>We focused on P @5 in our analysis to reflect our interest in QBD
for interactive interfaces. Five documents is close to the maximum
within which a user can immediately perceive the presence of
relevant documents. The P @k results for all depths from 1 to 20
follow a similar pattern, however, with relative diferences being
more stable across richness bins as k increases. The R-precision
results, which correspond to P @k for k equal to the number of
documents, are a limiting case.
55
0.35
0.33
0.31
0.37
0.35
0.35</p>
        <p>Are the diferences discussed here statistically significant? Claims
of statistical significance in ad hoc retrieval experiments are
typically based on the problematic assumption that the queries in a
test collection are a random sample from some population. Outside
of query log studies, this is always false. It is particularly false for
the RCV1-v2 categories, which are part of a single indexing system.
Thus, despite the fact that each value in Table 1 is based on more
than 1000 data points, we eschew such claims. We believe
convincing evidence for analyses such as ours requires replication across
diferent datasets and diferent experiment designs.
5.3</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Comparison with LYRL2004 Results</title>
      <p>
        The 2004 paper by Lewis, Yang, Rose, and Li introducing the
RCV1v2 collection includes graphs showing that text classification
effectiveness generally increases with increasing category richness
[
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. That analysis, however, was based on classifiers produced
by applying supervised learning to a training set of 23,149
documents. Category richness strongly afected the number of positive
examples present for a given category within that fixed training
set.
      </p>
      <p>Thus the RCV1-v2 results conflated the quality of the training
data available for a category with the dificulty of the classification
task for that category. In contrast, our results are based on making
exactly the same amount of data (one positive document) available
for each run on each category.</p>
      <p>
        The impact of richness was also obscured in the RCV1-v2 paper
by its focus on binary text classifiers evaluated by F1 (harmonic
mean of recall and precision). For categories with high richness,
large values of F1 can be achieved simply by classifying all test
examples as positive. For RCV1-v2, the highest richness category
would get an F1 score of 0.635 under this strategy [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. Even random
classification of documents will produce a nontrivial value of F1
for high richness categories. This lower bounding of efectiveness
for trivial approaches reflects the fact that richness is, for practical
purposes, a floor on precision [
        <xref ref-type="bibr" rid="ref37 ref42">37, 42</xref>
        ].
      </p>
      <p>
        Our experimental design was not immune to this generality
efect [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ], but minimized it to the extent possible. We ensured
that efectiveness values in the full range from 0.0 to 1.0 were
logically possible for all queries, categories, and measures. Thus
neither absolute nor relative efects of richness result from ceiling or
lfoor efects. Our use of ranking-based efectiveness measure means
there is no analogue to a trivial system that treats all examples as
relevant. For the highest frequency category, a trivial system that
randomly ordered documents would have an expected precision of
0.465 for the most frequent category. However, for most categories
the expected precision of such a system would be well below 0.01.
6
      </p>
    </sec>
    <sec id="sec-11">
      <title>REIMPLEMENTATION IN SCIKIT-LEARN</title>
      <p>A major impetus for our work is the hope of using QBD
methods to improve first round efectiveness for active learning of text
classifiers. As a first step, we reimplemented Elasticsearch’s BM25
variant in scikit-learn, a machine learning toolkit.</p>
      <p>We first created a matrix of raw term frequency values using
CountVectorizer from scikit-learn6. We then extended an existing
BM25 implementation7 and created BM25 query and document
vectors intended to be identical to those from Elasticsearch.</p>
      <p>The first two lines of Table 2 compare the Elasticsearch BM25
results using all query document terms to those from our first
scikitlearn implementation. To our surprse, the Elasticsearch results were
notably diferent, and often better.</p>
      <sec id="sec-11-1">
        <title>6http://scikit-learn.org/stable/modules/generated/</title>
        <p>sklearn.feature_extraction.text.CountVectorizer.html
7 https://github.com/scikit-learn/scikit-learn/pull/6973</p>
        <p>Elasticsearch: Simple Query String
5 Original Result: With IDF and Raw QTF
n@ Same Param.
s Same Param. + Sklearn IDF Smoothing
o
i
i
re Same Param. + ES IDF Smoothing
c
P Same Param. + ES IDF Smoothing + Token</p>
        <p>Elasticsearch: Simple Query String
n Original Result: With IDF and Raw QTF
o
s Same Param.
i
i
e Same Param. + Sklearn IDF Smoothing
c
r
-PR Same Param. + ES IDF Smoothing</p>
        <p>Same Param. + ES IDF Smoothing + Token
55
0.37
0.36
0.35
0.35
0.37
0.37
We noticed that Elasticsearch used b=0.75 and k1=1.2 for as their
defaults, but b=0.75 and k1=2.0 were defaults in the BM25
implementation we built on. We changed our values in scikit-learn to the
the Elasticsearch ones, with results shown in the row marked with
“Same Param.”. Our results were closer to those of Elasticsearch
in most richness bins, but diverged slightly more in the lowest
richness bin.
6.2</p>
        <p>IDF</p>
      </sec>
      <sec id="sec-11-2">
        <title>The standard definition of IDF weighting [36] is: log</title>
        <p>N
nj
where N is the number of documents in the collection, and nj is
the number of documents that term j occurs in. Logarithms base 2,
e, and 10 are all commonly used with IDF, with the choice having
no efect in typical TFxIDF term weighting applications.</p>
        <p>The scikit-learn BM25 implementation that served as the
starting point for our scikit-learn work used scikit-learn’s built-in IDF
weighting, which is controlled by the boolean flag -smooth_idf. 8
Our original scikit-learn BM25 had IDF smoothing turned of. We
tried a run (row: “Same Param. + Sklearn IDF Smoothing”) with
IDF smoothing turned on, and the results were closer to but not
identical to those of Elasticsearch.</p>
        <p>Disabling -smooth_idf in scikit-learn uses this version of IDF:
while enabling the option uses this version:</p>
      </sec>
      <sec id="sec-11-3">
        <title>8http://scikit-learn.org/stable/modules/generated/</title>
        <p>sklearn.feature_extraction.text.TfidfVectorizer.html</p>
        <p>1 + ln
1 + ln</p>
        <p>N
nj
N + 1
nj + 1
Neither of these is the standard definition of IDF weighting. Nor
is either what the above scikit-learn documentation states is the
"textbook" definition:
in what appears to be a typographical error.</p>
        <p>
          Elasticsearch implements yet a fifth version, the one suggested
by the probabilistic derivation of BM25 [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]:
log
        </p>
        <p>N
nj + 1
ln</p>
        <p>N − nj + 0.5</p>
        <p>nj + 0.5</p>
        <p>We added this variant to our scikit-learn implementation, and
got the results shown in row Same Param. + ES IDF Smoothing.
6.3</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Tokenization</title>
      <p>At this point we had very similar efectiveness from the two
implementations, but decided to shoot for an exact replication. The
tokenizer in scikit-learn separates a character string into words at
boundaries between Unicode word and non-word characters by
applying regular expressions. Elaticsearch tokenizes text by
applying the Unicode Text Segmentation algorithm specified in Unicode
Standard Annex #299.</p>
      <p>We therefore extracted the tokens and raw TF values for each
document from Elasticsearch via the Term Vector API. We used this
data to create a raw TF matrix in scikit-learn, and created BM25
query and document vectors as above. Surprisingly, the results,
marked “Same Param. + ES IDF Smoothing + Token”, still difered
slightly from the Elasticsearch results.
6.4</p>
    </sec>
    <sec id="sec-13">
      <title>Document Length</title>
      <p>At this point we had identical raw TF vectors and identical
weighting formulas. Yet when we examined individual document scores</p>
      <sec id="sec-13-1">
        <title>9http://unicode.org/reports/tr29/</title>
        <p>from corresponding runs, many were slightly diferent. Using
Elasticsearch’s explain API10 uncovered the diference.</p>
        <p>Elasticsearch inherits from Lucene a lossy compression of
document lengths11. All document lengths from 0 to 2,013,265,944 are
compressed into 7 bits using a coding table12. Elasticsearch’s BM25
implementation uses these approximate lengths. We verified for
individual documents that correcting for this removes the last
diference between the Elasticsearch and scikit-learn scores. (We chose
not to implement this quirky scheme in our scikit-learn code.)</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>7 FUTURE WORK</title>
      <p>
        Querying with documents lives intriguingly at the intersection of
ad hoc retrieval and supervised learning. Our results suggest that
QBD using ad hoc retrieval workhorse BM25 is a solid approach.
BM25 can be viewed as Naive Bayes combined with negative blind
feedback [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]; so our results reinforce the usefulness of generative
models with tiny training sets [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. We plan next to compare the
downstream efects on active learning of using BM25 on the first
round and to test schemes for transitioning from BM25 to
discriminative supervised learning when enough training data has been
accumulated. The fact that Naive Bayes and logistic regression can
be viewed as, respectively, generative and discriminative variants
of the same model [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] may provide useful for this purpose.
      </p>
      <p>The strong impact of richness on absolute and relative
efectiveness is intriguing and cries out for study on multiple datasets.
Immediate questions are 1) whether the efect is an artifact of using
random examples as simulated QBD queries or of using text
categorization topics as simulated user interests; 2) whether it carries
beyond QBD to ad hoc retrieval with short queries; and 3) whether
it is a function of simple frequency, topical generality, or both. Some
of these factors can be explored with existing datasets, but others
may require new test collection work.</p>
    </sec>
    <sec id="sec-15">
      <title>8 CONCLUSION</title>
      <p>Our interest in QBD was sparked by the problem of supervised
learning on tiny training sets. However, using a document as a
query is also a widely supported information access tool and, to our
mind, an understudied one. BM25 weighting turns out to be an
efective approach, and its origins in supervised learning (naive Bayes)
suggest interesting approaches for kicking of active learning.</p>
      <p>The assumption in the QBD literature that query pruning is
crucial was not borne out by our work. Indeed, taken literally,
pruning cannot actually be necessary. One can always achieve the
same efect by suficiently small weights, and this perhaps is a
more natural perspective in supervised learning contexts. To the
extent that pruning is desirable for eficiency reasons in inverted
ifle retrieval, this perhaps should be treated as a query optimization
issue, not a modeling one.</p>
      <p>The eforts required to replicate the results of an open source
search engine using an open source machine learning toolkit was a
reminder of the range of factors that impact the efectiveness of text
processing systems. Our experience provides yet more evidence
10https://www.elastic.co/guide/en/elasticsearch/reference/current/searchexplain.html
11https://github.com/elastic/elasticsearch/issues/24620
12https://issues.apache.org/jira/browse/LUCENE-7730
that care is needed in interpreting small diferences in published
retrieval results, and that ongoing attention is needed to replicability
in IR research.</p>
      <p>Finally, the overwhelming impact of richness on efectiveness
in our experiments was both intriguing and unsettling. Suppose
such an efect were to hold not just for our simulation using a text
categorization data set, but also for widely used ad hoc retrieval
test collections. This would raise the possibility that the outcomes
of many past experiments on ad hoc retrieval were predestined by
test collection design choices.</p>
      <p>Further, we would have no way to tell if this were true. With
the exception of some work in e-discovery, all major public test
collections have at least partially decoupled true richness for
actual topical richness, and left topical richness unknown and not
measurable after the fact. We suggest that organizers of future IR
evaluations consider explicit control and measurement of richness
in test collection creation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>IJsbrand</given-names>
            <surname>Jan Aalbersberg</surname>
          </string-name>
          .
          <year>1992</year>
          .
          <article-title>Incremental relevance feedback</article-title>
          .
          <source>In SIGIR 1992. ACM</source>
          ,
          <volume>11</volume>
          -
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>James</given-names>
            <surname>Allan</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>HARD track overview in TREC 2003 high accuracy retrieval from documents</article-title>
          .
          <source>In TREC</source>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>James</given-names>
            <surname>Allan</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>HARD track overview in TREC 2004 high accuracy retrieval from documents</article-title>
          .
          <source>In TREC</source>
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>James</given-names>
            <surname>Allan</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>HARD track overview in TREC 2005 high accuracy retrieval from documents</article-title>
          .
          <source>In TREC</source>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Mossaab</given-names>
            <surname>Bagdouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>William</given-names>
            <surname>Webber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>David D</given-names>
            <surname>Lewis</surname>
          </string-name>
          , and Douglas W Oard.
          <year>2013</year>
          .
          <article-title>Towards minimizing the annotation cost of certified text classification</article-title>
          .
          <source>In CIKM 2013. ACM</source>
          ,
          <volume>989</volume>
          -
          <fpage>998</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Jason</surname>
            <given-names>R Baron</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael D</given-names>
            <surname>Berman</surname>
          </string-name>
          , and Ralph C Losey.
          <year>2016</year>
          .
          <article-title>Perspectives on Predictive Coding and Other Advanced Search Methods for the Legal Practitioner</article-title>
          . ABA Book Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Michael</surname>
            <given-names>S Bernstein</given-names>
          </string-name>
          , Bongwon Suh, Lichan Hong, Jilin Chen, Sanjay Kairam, and Ed H Chi.
          <year>2010</year>
          .
          <article-title>Eddi: interactive topic-based browsing of social status streams</article-title>
          .
          <source>In UIST 2010. ACM</source>
          ,
          <volume>303</volume>
          -
          <fpage>312</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Abdur</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          , Ophir Frieder, David Grossman, and
          <string-name>
            <surname>Mary Catherine McCabe</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Collection statistics for fast duplicate document detection</article-title>
          .
          <source>TOIS 20</source>
          ,
          <issue>2</issue>
          (
          <year>2002</year>
          ),
          <fpage>171</fpage>
          -
          <lpage>191</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Gordon</surname>
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Cormack</surname>
            and
            <given-names>Maura F.</given-names>
          </string-name>
          <string-name>
            <surname>Grossman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Evaluation of machinelearning protocols for technology-assisted review in electronic discovery</article-title>
          .
          <source>SIGIR</source>
          <year>2014</year>
          (
          <year>2014</year>
          ),
          <fpage>153</fpage>
          -
          <lpage>162</lpage>
          . https://doi.org/10.1145/2600428.2609601.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Gordon</surname>
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Cormack</surname>
            and
            <given-names>Maura F.</given-names>
          </string-name>
          <string-name>
            <surname>Grossman</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Autonomy and reliability of continuous active learning for technology-assisted review</article-title>
          .
          <source>arXiv</source>
          (
          <year>2015</year>
          ),
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W.B.</given-names>
            <surname>Croft</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.J.</given-names>
            <surname>Harper</surname>
          </string-name>
          .
          <year>1979</year>
          .
          <article-title>Using Probabilistic Models of Document Retrieval without Relevance Information</article-title>
          .
          <source>JDoc 35</source>
          ,
          <issue>4</issue>
          (
          <year>1979</year>
          ),
          <fpage>282</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jonathan</surname>
            <given-names>T</given-names>
          </string-name>
          <string-name>
            <surname>Foote</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Content-based retrieval of music and audio</article-title>
          .
          <source>In Multimedia Storage and Archiving Systems II</source>
          , Vol.
          <volume>3229</volume>
          .
          <source>International Society for Optics and Photonics</source>
          ,
          <volume>138</volume>
          -
          <fpage>148</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Atsushi</surname>
            <given-names>Fujii</given-names>
          </string-name>
          , Makoto Iwayama, and Noriko Kando. [n. d.].
          <source>Overview of the Patent Retrieval Task at the NTCIR-6 Workshop</source>
          ..
          <source>In NCTIR</source>
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Julien</surname>
            <given-names>Gobeill</given-names>
          </string-name>
          , Douglas Theodoro, and Patrick Ruch. [n. d.].
          <article-title>Exploring a Wide Range of Simple Pre and Post Processing Strategies for Patent Searching in CLEF IP 2009.</article-title>
          . In CLEF (Working Notes).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Maura</surname>
            <given-names>R Grossman</given-names>
          </string-name>
          , Gordon V Cormack,
          <string-name>
            <given-names>and Adam</given-names>
            <surname>Roegiest</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>TREC 2016 Total Recall Track Overview</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Manish</surname>
            <given-names>Gupta</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Bendersky</surname>
          </string-name>
          , et al.
          <year>2015</year>
          .
          <article-title>Information retrieval with verbose queries</article-title>
          .
          <source>F&amp;T in IR 9</source>
          ,
          <issue>3</issue>
          -
          <fpage>4</fpage>
          (
          <year>2015</year>
          ),
          <fpage>209</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Donna</given-names>
            <surname>Harman</surname>
          </string-name>
          .
          <year>1992</year>
          . Information Retrieval. Prentice-Hall, Inc., Upper Saddle River, NJ, USA,
          <source>Chapter Relevance Feedback and Other Query Modification Techniques</source>
          ,
          <fpage>241</fpage>
          -
          <lpage>263</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Donna</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Harman</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>The TREC Test Collections</article-title>
          . In TREC:
          <article-title>Experiment and Evaluation in Information Retrieval</article-title>
          . MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Claudia</surname>
            <given-names>Hauf</given-names>
          </string-name>
          , Djoerd Hiemstra, and Franciska de Jong.
          <year>2008</year>
          .
          <article-title>A survey of preretrieval query performance predictors</article-title>
          .
          <source>In CIKM 2008. ACM</source>
          ,
          <volume>1419</volume>
          -
          <fpage>1420</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Haibo</given-names>
            <surname>He and Edwardo A Garcia</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Learning from imbalanced data</article-title>
          .
          <source>IEEE Transactions on knowledge and data engineering 21</source>
          ,
          <issue>9</issue>
          (
          <year>2009</year>
          ),
          <fpage>1263</fpage>
          -
          <lpage>1284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Marti</given-names>
            <surname>Hearst</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Search user interfaces</article-title>
          . Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Bruce</surname>
            <given-names>Hedin</given-names>
          </string-name>
          , Stephen Tomlinson,
          <string-name>
            <surname>Jason R Baron</surname>
          </string-name>
          , and Douglas W Oard.
          <year>2009</year>
          .
          <article-title>Overview of the TREC 2009 legal track</article-title>
          .
          <source>Technical Report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Weiming</surname>
            <given-names>Hu</given-names>
          </string-name>
          , Nianhua Xie,
          <string-name>
            <surname>Li</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Xianglin</given-names>
            <surname>Zeng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Maybank</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>A survey on visual content-based video indexing and retrieval</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          , Part C (
          <article-title>Applications</article-title>
          and Reviews)
          <volume>41</volume>
          ,
          <issue>6</issue>
          (
          <year>2011</year>
          ),
          <fpage>797</fpage>
          -
          <lpage>819</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Nathalie</given-names>
            <surname>Japkowicz</surname>
          </string-name>
          and
          <string-name>
            <given-names>Shaju</given-names>
            <surname>Stephen</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>The class imbalance problem: A systematic study</article-title>
          .
          <source>Intelligent data analysis 6</source>
          ,
          <issue>5</issue>
          (
          <year>2002</year>
          ),
          <fpage>429</fpage>
          -
          <lpage>449</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Martha</surname>
            <given-names>Larson</given-names>
          </string-name>
          ,
          <source>Gareth JF Jones</source>
          , et al.
          <year>2012</year>
          .
          <article-title>Spoken content retrieval: A survey of techniques and technologies</article-title>
          .
          <source>F&amp;T in IR 5</source>
          ,
          <issue>4</issue>
          -
          <fpage>5</fpage>
          (
          <year>2012</year>
          ),
          <fpage>235</fpage>
          -
          <lpage>422</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>David</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
          </string-name>
          and William A Gale.
          <year>1994</year>
          .
          <article-title>A sequential algorithm for training text classifiers</article-title>
          .
          <source>In SIGIR 1994</source>
          . Springer-Verlag New York, Inc.,
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>David</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
            and
            <given-names>Richard M.</given-names>
          </string-name>
          <string-name>
            <surname>Tong</surname>
          </string-name>
          .
          <year>1992</year>
          .
          <article-title>Text Filtering in MUC-3 and MUC-4</article-title>
          .
          <source>In Proceedings of the 4th Conference on Message Understanding</source>
          .
          <fpage>51</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>David</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>Yiming</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            , Tony G. Rose, and
            <given-names>Fan</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>RCV1: A New Benchmark Collection for Text Categorization Research</article-title>
          . JMLR
          <volume>5</volume>
          (
          <year>2004</year>
          ),
          <fpage>361</fpage>
          -
          <lpage>397</lpage>
          . https://doi.org/10.1145/122860.122861
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Chao</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Chen Chen, Jiawei Han, and
          <string-name>
            <surname>Philip S Yu</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>GPLAG: detection of software plagiarism by program dependence graph analysis</article-title>
          .
          <source>In SIGKDD 2006. ACM</source>
          ,
          <volume>872</volume>
          -
          <fpage>881</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Ying</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Dengsheng Zhang, Guojun Lu, and
          <string-name>
            <surname>Wei-Ying Ma</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>A survey of content-based image retrieval with high-level semantics</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>40</volume>
          ,
          <issue>1</issue>
          (
          <year>2007</year>
          ),
          <fpage>262</fpage>
          -
          <lpage>282</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Mihai</surname>
            <given-names>Lupu</given-names>
          </string-name>
          , Harsha Gurulingappa,
          <string-name>
            <surname>Igor</surname>
            <given-names>Filippov</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            <given-names>Jiashu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juliane Fluck</surname>
            , Marc Zimmermann, Jimmy Huang,
            <given-names>and John</given-names>
          </string-name>
          <string-name>
            <surname>Tait</surname>
          </string-name>
          . [n. d.].
          <source>Overview of the TREC 2011 Chemical IR Track</source>
          . ([n. d.]).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>A.Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes</article-title>
          .
          <source>NIPS</source>
          (
          <year>2001</year>
          ),
          <fpage>841</fpage>
          -
          <lpage>848</lpage>
          . https: //doi.org/10.1007/s11063-008-9088-7 arXiv:http://dx.doi.org/10.1007/s11063-008- 9088-7
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Andrew</surname>
            <given-names>Y</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Feature selection, L 1 vs. L 2 regularization, and rotational invariance</article-title>
          .
          <source>In 2004 ICML. ACM</source>
          ,
          <volume>78</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Florina</given-names>
            <surname>Piroi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Allan</given-names>
            <surname>Hanbury</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Evaluating Information Retrieval Systems on European Patent Data: The CLEF-IP Campaign</article-title>
          .
          <source>In Current Challenges in Patent Information Retrieval</source>
          . Springer,
          <fpage>113</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Understanding inverse document frequency: on theoretical arguments for IDF</article-title>
          .
          <source>JDoc 60</source>
          ,
          <issue>5</issue>
          (
          <year>2004</year>
          ),
          <fpage>503</fpage>
          -
          <lpage>520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Stephen</surname>
            <given-names>Robertson</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hugo</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          , et al.
          <year>2009</year>
          .
          <article-title>The probabilistic relevance framework: BM25 and beyond</article-title>
          .
          <source>F&amp;T in IR 3</source>
          ,
          <issue>4</issue>
          (
          <year>2009</year>
          ),
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Stephen</surname>
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Robertson</surname>
          </string-name>
          .
          <year>1969</year>
          .
          <article-title>The Parametric Description of Retrieval Texts: Part I: The Basic Parameters</article-title>
          .
          <source>Journal of Documentation 25</source>
          ,
          <issue>1</issue>
          (
          <year>1969</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Stephen</surname>
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Robertson</surname>
            and
            <given-names>Steve</given-names>
          </string-name>
          <string-name>
            <surname>Walker</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Some simple efective approximations to the 2-poisson model for probabilistic weighted retrieval</article-title>
          .
          <source>In SIGIR 1994</source>
          . Springer-Verlag New York, Inc.,
          <fpage>232</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <surname>Stephen</surname>
            <given-names>E Robertson</given-names>
          </string-name>
          , Steve Walker, Susan Jones,
          <string-name>
            <surname>Micheline M Hancock-Beaulieu</surname>
            ,
            <given-names>Mike</given-names>
          </string-name>
          <string-name>
            <surname>Gatford</surname>
          </string-name>
          , et al.
          <year>1995</year>
          . Okapi at TREC-
          <volume>3</volume>
          . 109 (
          <year>1995</year>
          ),
          <fpage>109</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>Joseph</given-names>
            <surname>John Rocchio</surname>
          </string-name>
          .
          <year>1971</year>
          .
          <article-title>Relevance feedback in information retrieval</article-title>
          . (
          <year>1971</year>
          ),
          <fpage>313</fpage>
          -
          <lpage>323</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Gerard</given-names>
            <surname>Salton</surname>
          </string-name>
          .
          <year>1970</year>
          .
          <article-title>The “generality” efect and the retrieval evaluation for large collections</article-title>
          .
          <source>Technical Report 70-67</source>
          . Cornell University, Ithaca, NY.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>Gerard</given-names>
            <surname>Salton</surname>
          </string-name>
          .
          <year>1972</year>
          .
          <article-title>The “generality” efect and the retrieval evaluation for large collections</article-title>
          .
          <source>Journal of the Association for Information Science and Technology 23</source>
          ,
          <issue>1</issue>
          (
          <year>1972</year>
          ),
          <fpage>11</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Wong</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>A Vector-Space Model for Automatic Indexing</article-title>
          .
          <source>Commun. ACM</source>
          <volume>18</volume>
          ,
          <issue>11</issue>
          (
          <year>1975</year>
          ),
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>Walid</given-names>
            <surname>Shalaby</surname>
          </string-name>
          and
          <string-name>
            <given-names>Wlodek</given-names>
            <surname>Zadrozny</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Patent Retrieval: A Literature Review</article-title>
          .
          <source>arXiv preprint arXiv:1701.00324</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>Fei</given-names>
            <surname>Song</surname>
          </string-name>
          and
          <string-name>
            <given-names>W Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>A general language model for information retrieval</article-title>
          .
          <source>In CIKM 1999. ACM</source>
          ,
          <volume>316</volume>
          -
          <fpage>321</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <surname>Amanda</surname>
            <given-names>Spink</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernard J Jansen</surname>
            , and
            <given-names>H Cenk</given-names>
          </string-name>
          <string-name>
            <surname>Ozmultu</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Use of query reformulation and relevance feedback by Excite users</article-title>
          .
          <source>Internet research 10</source>
          ,
          <issue>4</issue>
          (
          <year>2000</year>
          ),
          <fpage>317</fpage>
          -
          <lpage>328</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>Suzan</given-names>
            <surname>Verberne and Eva DâĂŹhondt</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Prior art retrieval using the claims section as a bag of words</article-title>
          .
          <source>In CLEF 2009</source>
          . Springer,
          <fpage>497</fpage>
          -
          <lpage>501</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grossman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Frieder</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Yurchak</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Efectiveness Results for Popular e-Discovery Algorithms</article-title>
          .
          <source>ICAIL</source>
          <year>2017</year>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <surname>Yin</surname>
            <given-names>Yang</given-names>
          </string-name>
          , Nilesh Bansal, Wisam Dakka, Panagiotis Ipeirotis, Nick Koudas, and
          <string-name>
            <given-names>Dimitris</given-names>
            <surname>Papadias</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Query by document</article-title>
          .
          <source>In WSDM 2009. ACM</source>
          ,
          <volume>34</volume>
          -
          <fpage>43</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>ChengXiang</given-names>
            <surname>Zhai</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Statistical language models for information retrieval</article-title>
          .
          <source>Synthesis Lectures on Human Language Technologies</source>
          <volume>1</volume>
          ,
          <issue>1</issue>
          (
          <year>2008</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>ChengXiang</given-names>
            <surname>Zhai and John Laferty</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Two-stage language models for information retrieval</article-title>
          .
          <source>In SIGIR 2002. ACM</source>
          ,
          <volume>49</volume>
          -
          <fpage>56</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <surname>Moshé</surname>
            <given-names>M</given-names>
          </string-name>
          <string-name>
            <surname>Zloof</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>Query by example</article-title>
          .
          <source>In Proceedings of the May 19-22</source>
          ,
          <year>1975</year>
          , National Computer Conference and Exposition. ACM,
          <volume>431</volume>
          -
          <fpage>438</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>