           A Distributed Effort Approach
              for Systematic Reviews.
       IMS Unipd at CLEF 2019 eHealth Task 2.

                              Giorgio Maria Di Nunzio1,2
                           Department of Information Engineering
                                Department of Mathematics
                                University of Padua, Italy

        Abstract. This is the third participation of the Information Manage-
        ment Systems (IMS) group at CLEF eHealth Task of Technologically
        Assisted Reviews in Empirical Medicine. This task focuses on the prob-
        lem of medical systematic reviews, a problem which requires a recall close
        (if not equal) to 100%. Semi-Automated approaches are essential to sup-
        port these type of searches when the amount of data exceed the limits
        of users, i.e. in terms of attention or patience. We present a variation
        of the system we presented last year; in particular, not only we set the
        maximum amount of documents that the physician is willing to read, but
        we distribute the effort across the topics proportionally to the number
        of documents in the pool. We compare the results of this approach with
        the “frozen” system we used in 2018 and a BM25 baseline.

1     Introduction
In this paper, we describe the participation of the Information Management
Systems (IMS) group at CLEF eHealth 2019 [2] Technology Assisted Review
Task [1]. This task focuses on the problem of systematic reviews, that is the
process of collecting articles that summarise all evidence (if possible) that has
been published regarding a certain medical topic. This task requires long search
sessions by experts in the field of medicine; for this reason, semi-automatic ap-
proaches are essential to support these type of searches when the amount of data
exceed the limits of users, i.e. in terms of attention or patience.
    The objective of our participation was to compare the system that we used
in the previous year, with a new strategy to distribute the effort of the user (the
physician or an expert in the field of medicine) across the topics. In particular,
 – we re-use the stopping strategy to simulate the maximum amount of docu-
   ments that a physician is willing to review in the two-dimensional approach
   presented in [5];
 – we distribute the effort, in terms of number of documents to read, propor-
   tionally to the size of the pool of documents for each topic;
 – we estimate the 95% confidence interval of the proportion of relevant docu-
   ments present in the collection [6].

The source code of the experiments is available for reproducibility purposes.

2     Approach

In this paper, we continue to investigate the interaction with the two dimensional
interpretation of the BM25 model applied to the problem of explicit relevance
feedback [9, 3, 8, 5, 7, 6].
    In particular, the two-dimensional representation of probabilities [4, 9] is an
intuitive way of presenting a two-class classification problem on a two-dimensional
space. Given two classes, for example relvant R and non-relevant N R, a docu-
ment d is assigned to category R if the following inequality holds:

                           P (d|N R) < m P (d|R) +q                            (1)
                           | {z }        | {z }
                                y             x

where P (d|R) and P (d|N R) are the likelihoods of the object d given the two cat-
egories, while m and q are two parameters that can be optimized to compensate
for either the unbalanced class issues or different misclassification costs.
    We focused on the following problems:

 1. study the effectiveness of a classifier given a fixed amount of documents that
    a physician is willing to review;
 2. design a sampling strategy to estimate the 95% confidence interval of the
    number of relevant documents in the collection.

In the experiments, we used the same procedure we used lst year [6]:

 – we set a number n of documents that the physician is willing to read and
   a number s that tells the algorithm when (every s documents) to randomly
   sample a document from the collection instead of presenting to the physician
   the next most relevant document;
 – for each topic, we run an optimized (hyper-parameters) BM25 retrieval
   model and we obtain the relevance feedback for the first abstract in the
   ranking list;
 – from the second document until n/2−1, we continuously update the relevance
   weights of the terms according to the explicit relevance feedback given by
   the physician (simulated by the qrels available with the test collection);
 – for the last half of the documents n/2 that the physician is willing to read, we
   use a Naı̈ve Bayes classifier continuously updated with the explicit relevance
   feedback [5].
                           topic    pool prop shown
                           CD000996 281 0.003     43
                           CD001261 571 0.007     86
                           CD004414 336 0.004     51
                           CD006468 3874 0.047   583
                           CD007867 943 0.011    142
                           CD008874 2382 0.029   359
                           CD009044 3169 0.038   477
                           CD009069 1757 0.021   265
                           CD009642 1922 0.023   290
                           CD010038 8867 0.108 1335
                           CD010239 224 0.003     34
                           CD010558 2815 0.034   424
                           CD010753 2539 0.031   382
                           CD011140 289 0.004     44
                           CD011558 2168 0.026   327
                           CD011571 146 0.002     22
                           CD011686 9729 0.118 1464
                           CD011768 9160 0.111 1379
                           CD011787 4369 0.053   658
                           CD011977 195 0.002     30
                           CD012069 3479 0.042   524
                           CD012080 6643 0.081 1000
                           CD012164   61 0.001    10
                           CD012233 472 0.006     72
                           CD012342 2353 0.029   355
                           CD012455 1593 0.019   240
                           CD012551 591 0.007     89
                           CD012567 6735 0.082 1014
                           CD012661 3367 0.041   507
                           CD012669 1260 0.015   190
                           CD012768 131 0.002     20
                   Table 1: Proportion of documents per topic.

    Instead of setting n equal for all topics, this year we tried a different approach
in order to let the user to read more documents for those topics with more
documents in the pool. In Table 1, we show, for each topic, the number of
documents in the pool, the proportion of documents of the pool compared to
the total number of documents pooled, the number of documents we will show
to the user (to be multiplied by 2).

3    Experiments

For all the experiments, we set the values of the BM25 hyper-parameters in the
same way we did in [6].
3.1   Official Runs

We submitted runs for three different systems:

 – a BM25 baseline with continuous active learning and a fixed threshold for
   each topic,
 – the “frozen” system fo 2018 with different proportions of documents to be
   read for the initial phase but with a fixed threshold for each topic,
 – the new approach with a different threshold for each topic.

In particular, for the frozen system, we used 10% or 50% of the initial pool of
documents per topic to build the classifier. The new distributed effort approach
uses 10% of the pool at the beginning of the training, but, in general, it may
stop earlier compared to the other approach if the effort required for a topic is
low in terms of documents allowed.

3.2   Unofficial Runs

In order to compare the BM25 model with a similar proportion of documents
shown to the user, we added some BM25 runs and removed some others that
showed a different number of documents.

3.3   Evaluation Measures

In order to evaluate the performance of the systems, we chose the number of
documents shown to the user as one of the performance measures since, in our
case, it is also the point where we stop retrieving documents. In addition, we use
recall and averaged recall across topics to measure the accuracy of the retrieval.

3.4   Results

In Figures 1 and 2, we show a topic by topic comparison of groups of runs: BM25,
distributed effort, orginal 2018 with 10% or 50% of the initial pool selected. By
increasing the threshold of the number of documents shown to the user, we are
able to tune the performance of the system and reach an average recall close to
100% for all the systems under evaluation. Some topics are much more difficult
than others; for example, topic CD011558 requires the retrieval of most of the
pooled documents in order to achieve a reasonable recall (around 0.8).
    In Figure 3, we show the performance of the four groups of runs in terms of
average recall (across topics) given the number of documents shown to the user.
By increasing the number of documents (from left to right) the four approaches
increase the average recall and go beyond 90% even with less than 4% of the
total number of documents, for example the two 2018 approaches of the frozen
    The distributed effort approach we proposed this year performed worse than
expected. It seems that by reducing the number of documents allowed per topic
too much, especially for topics with smaller pools, we obtain a suboptimal system
compared to the original one. In other terms, it may be more convenient to set
up a fixed cost per topic and use all the documents of the pool if necessary,
instead of saving some resources for topics with more documents in the pool.

4    Conclusions

In this work, we presented a variation of the continuous active learning approach
used in [6] that uses a fixed stopping strategy to simulate the maximum amount
of documents that a physician is willing to review and a sampling strategy that is
used to estimate the number of relevant documents in the collection. The result
of the distributed effort approach were worse than expected, compared to the
original approach in presented in 2018. The performance of the new system is
still remarkable since it achieves an average recall of 90% by using only 10% of
the documents in the collection; however, the original system can achieve the
same results by reducing the number of documents shown to the user by half.
     We are currently analyzing the results provided by the organizers and adding
to the official runs a set of unofficial runs that will complete the picture of all
the possible settings. As future work, we will study a methodology to dynami-
cally vary the amount of documents according to the estimate of the amount of
relevant documents still missing.


