=Paper= {{Paper |id=Vol-2621/CIRCLE20_23 |storemode=property |title=Beaver: Efficiently Building Test Collections for Novel Tasks |pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_23.pdf |volume=Vol-2621 |authors=David Otero,Javier Parapar, Álvaro Barreiro |dblpUrl=https://dblp.org/rec/conf/circle/0001PB20 }} ==Beaver: Efficiently Building Test Collections for Novel Tasks== https://ceur-ws.org/Vol-2621/CIRCLE20_23.pdf
      Beaver: Efficiently Building Test Collections for Novel Tasks
                   David Otero                                            Javier Parapar                                Álvaro Barreiro
          david.otero.freijeiro@udc.es                                javier.parapar@udc.es                             barreiro@udc.es
           Information Retrieval Lab                                Information Retrieval Lab                      Information Retrieval Lab
            University of A Coruña                                   University of A Coruña                         University of A Coruña
                A Coruña, Spain                                          A Coruña, Spain                                A Coruña, Spain

ABSTRACT                                                                             of evaluating methodologies for the early detection of risks on the
Evaluation is a mandatory task for Information Retrieval research.                   Internet [5]. These risks are especially related to mental diseases
Under the Cranfield paradigm, this evaluation needs test collections.                like self-harm, anorexia, and depression.
The creation of these is a time and resource-consuming process. At                      In previous years, eRisk organizers freed test collections formed
the same time, new tasks and models are continuously appearing.                      by texts written by users of Reddit2 . Those datasets were used by the
These tasks demand the building of new test collections. Typically,                  competition participants to train their models to be evaluated in the
the researchers organize TREC-like competitions for building these                   test splits. In past editions, the ground truth for datasets (training
evaluation benchmarks. This is very expensive, both for the orga-                    and test) was built by manually searching relevant posts, that talked
nizers and for the participants. In this paper, we present a platform                about the correspondent topics, to be judged by assessors, resulting
to easily and cheaply build datasets for Information Retrieval eval-                 in a prolonged and laborious process.
uation without the need of organizing expensive campaigns. In                           Our platform proposes to ease the process of building the collec-
particular, we propose the simulation of participant systems and                     tion by simulating participant systems results and using pooling
the use of pooling strategies to make the most of the assessor’s                     strategies that make the most of the assessor’s work. We will guide
work. Our platform is aimed to cover the whole process of building                   this article throughout the process of building a test collection
the test collection, from document gathering to judgment creation.                   about self-harm. The reader can build his own test collection with
                                                                                     this platform, which is live on the following link3 . Two user roles
KEYWORDS                                                                             are defined in the platform. To create new experiments and export
                                                                                     the collections you can log in as admin@admin.com (pass: admin).
Information Retrieval, Test Collections, Pooling
                                                                                     To judge documents from an experiment you can log in as asses-
                                                                                     sor@assessor.com (pass: assessor).
1    INTRODUCTION
Information Retrieval (IR) research is deeply rooted in experimenta-                 3     SELF-HARM EXAMPLE
tion and evaluation [8]. Under the Cranfield paradigm, evaluation
                                                                                     According to the World Health Organization (WHO), self-harm is
requires proper infrastructure: methodologies, metrics, and test
                                                                                     ‘an act with non-fatal outcome, in which an individual deliberately
collections. This paper focuses on building the latter. Collections
                                                                                     initiates a non-habitual behavior that, without intervention from
are formed by documents, topics that describe the user information
                                                                                     others, will cause self-harm, or deliberately ingests a substance in
needs, and relevance judgments, which specify the documents that
                                                                                     excess of the prescribed or generally recognized therapeutic dosage,
are relevant to them [10]. Typically, collections are the results of
                                                                                     and which is aimed at realizing changes which the subject desired via
expensive evaluation campaigns such as the TREC tracks. In these
                                                                                     the actual or expected physical consequences’. Inside the self-harm
forums, one of the most expensive activities is the obtention of
                                                                                     disease, there are various classifications according to the means
relevance judgments, which requires much time and human effort.
                                                                                     a person uses to inflict harm on himself (ICD10 X71-X834 ). We
This is a handicap to teams that aim to build a new dataset of a
                                                                                     will use these various types of self-harm to guide the document
specific domain or to many new tasks that need to provide training
                                                                                     gathering process.
data before the challenge celebration [7]. In these cases, the con-
                                                                                        The first step is to obtain the documents from a document source.
struction of judgments can not depend on the results of competition
                                                                                     In this example, our platform uses the Reddit API to download the
participants.
                                                                                     texts published by users of this social network. Figure 1 shows the
   In this paper, we present a platform to ease the construction of
                                                                                     architecture and main components of the system. The flexibility
test collections without the need to organize evaluation campaigns
                                                                                     of our architecture will allow us to introduce further sources of
and thus facilitating the research in IR. We joined in a single plat-
                                                                                     documents in the future. To obtain the documents, the user may
form the process of obtaining the source documents, producing the
                                                                                     specify different query variants to be used for retrieving posts
relevance judgments, and exporting the collection.
                                                                                     from Reddit. In this example, it makes sense to use several query
                                                                                     variants related to the different classes of self-harm explained above.
2    MOTIVATION
                                                                                     In particular: ‘drown myself’, ‘cut myself’, ‘punch myself’, ‘shot
For illustrating our platform, we will use a novel task example:                     myself’, ‘burn myself’ are good examples. The system will use those
CLEF eRisk1 . This is a workshop organized each year with the aim
                                                                                     2 https://www.reddit.com
"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0)."                             3 https://beaver.irlab.org
1 https://erisk.irlab.org                                                            4 https://www.icd10data.com/ICD10CM/Codes/V00-Y99/X71-X83
                                                                                                              David Otero, Javier Parapar, and Álvaro Barreiro


                                                                          4     CONCLUSIONS AND FUTURE WORK
                                                                          Test collections are vital for IR evaluation, but obtaining the rel-
                                                                          evance judgments is an expensive task. In this article, we have
                                                                          presented a platform to easily and cheaply build test collections by
                                                                          lessening the need for organizing an evaluation campaign. The use
                                                                          of intelligent pooling strategies that heavily reduce the assessor’s
                                                                          work makes this process a cheaper task. This system is very suitable
                                                                          to be used by research teams that want to build a collection within
                                                                          a specific domain because they do not need to previously organize
                                                                          a competition to obtain the runs of the participant systems. We
                                                                          plan to use the system as a testbed for evaluating pooling effects in
                                                                          the datasets’ quality.

                                                                          ACKNOWLEDGMENTS
                 Figure 1: System architecture.                           This work was supported by project RTI2018-093336-B-C22 (MCIU
                                                                          & ERDF), project GPC ED431B 2019/03 (Xunta de Galicia & ERDF),
                                                                          and accreditation ED431G 2019/01 (Xunta de Galicia & ERDF).

                                                                          REFERENCES
queries to download the whole history of users with posts matching         [1] James Allan, Donna Harman, Evangelos Kanoulas, Dan Li, Christophe Van Gysel,
them (in the case of eRisk, the retrieval unit is a user history).             and Ellen M Voorhees. 2017. TREC 2017 Common Core Track Overview. In Pro-
   After downloading the documents, our platform creates different             ceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg,
                                                                               Maryland, USA, November 15-17, 2017, Ellen M Voorhees and Angela Ellis (Eds.),
document rankings simulating participant systems. The rankings                 Vol. Special Pu. National Institute of Standards and Technology NIST.
are created by combining the introduced query variants with dif-           [2] Javed A. Aslam, Virgiliu Pavlu, and Robert Savell. 2003. A Unified Model for
ferent retrieval models, such as BM25, Language Models, TF-IDF,                Metasearch, Pooling, and System Evaluation. In Proceedings of the Twelfth In-
                                                                               ternational Conference on Information and Knowledge Management (CIKM ’03).
or other ones implemented in our platform. In our example, we                  ACM, New York, NY, USA, 484–491.
simulate 20 runs by combining the 5 aforementioned self-harm               [3] Gordon V. Cormack, Christopher R. Palmer, and Charles L. A. Clarke. 1998.
                                                                               Efficient Construction of Large Test Collections. In Proceedings of the 21st Annual
query variants with 4 ranking models.                                          International ACM SIGIR Conference on Research and Development in Information
   Then the admin selects the pooling strategy over the simulated              Retrieval (SIGIR ’98). ACM, New York, NY, USA, 282–289.
participants’ results. This choice will decide the order for presenting    [4] Aldo Lipani, David E. Losada, Guido Zuccon, and Mihai Lupu. 2019. Fixed-Cost
                                                                               Pooling Strategies. (2019). to appear in TKDE.
the documents to the assessor. Currently, there are two pooling            [5] David E Losada, Fabio Crestani, and Javier Parapar. 2019. Overview of eRisk 2019
methods implemented: traditional DocID [10] and MoveToFront                    Early Risk Prediction on the Internet. In Experimental IR Meets Multilinguality,
(MTF) [3], although more strategies are being implemented, such                Multimodality, and Interaction, Fabio Crestani, Martin Braschler, Jacques Savoy,
                                                                               Andreas Rauber, Henning Müller, David E Losada, Gundula Heinatz Bürki, Linda
as Hedge [2] and Bayesian Bandits [6]. The CORE Track 2017 used                Cappellato, and Nicola Ferro (Eds.). Springer International Publishing, Cham,
the last one for creating the judgments [1], being the first time              340–357.
                                                                           [6] David E. Losada, Javier Parapar, and Alvaro Barreiro. 2017. Multi-armed bandits
that TREC decided to replace traditional DocID. These strategies               for adjudicating documents in pooling-based evaluation of information retrieval
aim to reduce the assessor’s time and effort in the creation of the            systems. Information Processing and Management (2017).
relevance judgments without harming their quality. In particular,          [7] Thomas Mandl, Sandip Modha, Prasenjit Majumder, Daksh Patel, Mohana Dave,
                                                                               Chintak Mandlia, and Aditya Patel. 2019. Overview of the HASOC Track at
the Max Mean method from [6] was recently demonstrated as the                  FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European
best one in terms of bias [4]. However, the reusability of judgments           Languages. In Proceedings of the 11th Forum for Information Retrieval Evaluation
constructed with these approaches is still an open research issue              (FIRE ’19). Association for Computing Machinery, New York, NY, USA, 14–17.
                                                                           [8] Ellen M. Voorhees. 2002. The Philosophy of Information Retrieval Evaluation.
[9] that we hope this platform will help to investigate.                       In Revised Papers from the Second Workshop of the Cross-Language Evaluation
   When the pooling phase starts, the assessor may begin to judge              Forum on Evaluation of Cross-Language Information Retrieval Systems (CLEF ’01).
                                                                               Springer-Verlag, London, UK, UK, 355–370.
the relevance of the Reddit users. To this aim, he will see all the        [9] Ellen M. Voorhees. 2018. On Building Fair and Reusable Test Collections Using
posts written by each user, divided into various pages. On every               Bandit Techniques. In Proceedings of the 27th ACM International Conference on
page, there are available two buttons to judge the relevance of the            Information and Knowledge Management (CIKM ’18). ACM, New York, NY, USA,
                                                                               407–416.
user. These buttons are presented in every page because the assessor      [10] Ellen M Voorhees and Donna K Harman. 2005. TREC: Experiment and Evaluation
may not need to see all the posts to establish if a user is relevant or        in Information Retrieval (Digital Libraries and Electronic Publishing). The MIT
not. For each post, the assessor is presented with its content, with           Press.

the publication date and with a link referring to the original post
in Reddit. The platform does not show any additional data (apart
from the user id) about the user to avoid introducing any bias in
the assessor’s decision. Additionally, the assessor has the option to
specify a query to search through the user’s publication history to
speed up the judging process. Finally, when the assessor completes
the work, the test collection can be exported by the administrator
along with the judgments.