Document-to-document relevance assessment for TREC
Genomics Track 2005
Olga Giraldo 1, María Fernanda Cadena 1,2,3, Andrea Robayo-Gama 1,2,3, Dhwani Solanki 1,4,
Tim Fellerhoff 1,5, Lukas Geist 1,6, Rohitha Ravinder 1,4, Muhammad Talha 1,6, Dietrich
Rebholz-Schuhmann 1,7 and Leyla Jael Castro 1
1
  ZB MED Information Centre for Life Sciences, Gleueler Str. 60, Cologne, 50931, Germany
2
  Institute of Molecular Medicine and Cell Research, University of Freiburg, Stefan-Meier-Str. 17, Freiburg im
  Breisgau, 79104, Germany
3
  Facultad de Farmacia y Bioquímica, Universidad de Buenos Aires, Junín 956,Buenos Aires, C1113AAD,
  Argentina
4
  Bonn-Aachen International Centre for Information Technology (B-IT), University of Bonn,
  Friedrich-Hirzebruch-Allee 6, Bonn, 53115, Germany
5
  Heinrich-Heine University Düsseldorf, Universitätsstraße 1, Düsseldorf, 40225, Germany
6
  Hochschule Bonn-Rhein-Sieg, Grantham-Allee 20, Sankt Augustin, 53757, Germany
7
  University of Cologne, Albertus-Magnus-Platz, Cologne, 50923, Germany


                 Abstract
                 Here we present a doc-2-doc relevance assessment performed on a subset of the TREC
                 Genomics Track 2005 collection. Our approach includes an experimental set up to manually
                 assess doc-2-doc relevance and the corresponding analysis done on the results obtained from
                 this experiment. The experiment takes one document as a reference and assesses a second
                 document regarding its relevance to the reference one. The consistency of the assessments
                 done by 4 domain experts was evaluated. The lack of agreement between annotators may be
                 due to: i) The abstract lacks key information and/or ii) Lack of experience of the annotators in
                 the evaluation of some topics.

                 Keywords 1
                 Relevance assessment, literature manual curation, document similarity


1. Introduction
   The TREC Genomics Track 2005 [1] provides a collection of document-to-topic relevance
assessments for Medline abstracts. This collection has commonly been used also for
document-to-document related tasks [2,3,4]; however, to the best of our knowledge, no analysis has
been done regarding the suitability of such collection for such tasks, e.g., similarity or relevance
assessment between a pair of documents. Our doc-2-doc relevance analysis aims at filling this gap. We
take one document as a “reference article” while a second one is evaluated wrt its relevance to the
referenced document. In this experiment the user is engaged in the evaluation process as a way to
achieve better results.


Proceedings Semantic Web Applications and Tools for Healthcare and Life Sciences, February 13–16, 2023, Basel, Switzerland
EMAIL: ljgarcia@zbmed.de (A. 10)
ORCID: 0000-0003-2978-8922 (A.1); 0000-0002-5915-8895 (A.2); 0000-0002-8725-1317 (A.5); 0000-0002-2910-7982 (A.6);
0000-0002-1018-0370 (A.9); 0000-0003-3986-0510 (A.10)
            ©️ 2023 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
2. Methodology
Selection of topics and documents. TREC 2005 Genomics track included a total of 50 topics. For the
document-to-document relevance assessment we wanted to have 15 documents to assess per each
reference document, at least 10 documents judged as definitely relevant wrt the TREC topic, and no
more than 80 relevant articles (either definitely or partially relevant). The goal was having a sample
covering about 10% of the relevant articles for the document-to-document assessment task. This gave
us a total of 16 topics which were further reduced to 8 topics due to time constraints and expertise of
the annotators on the different TREC topics. These topics contained a total of 42 reference documents
and 630 documents to be assessed.
Development of an in-house annotation tool and a corpus of documents. The tool [5] presents a
corpus of documents organized by topics. The corpus is based on the TREC 2005 Genomics track
with some pre-processing to obtain the reference documents and the ones to be assessed against them.
The annotation workflow is shown on Figure 1.


Figure 1: Assessments steps in the relevance assessment tool
Training sessions. Virtual sessions were organized with the participants in the evaluation of
documents to train them in the use of the tool, and to solve doubts. The meetings were carried out by
using Zoom. The participants were 4 domain experts with expertise in life science and/or
bioinformatics.
Relevance assessment by domain experts. In the tool, the documents were organized by topics.
Where each topic includes “N” number of reference articles. Each reference article includes 15
documents (or evaluation articles), to be assessed. Only title and abstract are available. The relevance
assessment possible values are as follows: i) Relevant to the reference article, meaning “Yes, the user
wants to get a hold of the full-text as it is definitely relevant to their research”. ii) Partially relevant to
the reference article, meaning “Looks promising but not sure yet. The user will keep the PMID just in
case, as a maybe”. iii) Non-relevant to the reference article, meaning “Not worth giving it a second
look at all”.
Analysis of results. Here the consistency of the assessments done by the 4 domain experts was
evaluated. We focused on inter-annotator agreement.

3. Results
Documents assessed. A total of 630 “evaluation articles” classified in 8 topics were assessed by 4
annotators. The evaluation articles are distributed into 42 reference articles (15 documents per
reference article). The full data is available online [6].

Inter-annotator agreement. Annotators rated the documents into three categories (2 definitely
relevant, 1 partially relevant, 0 non-relevant). We observed that the four annotators all rated as
“definitely relevant” 35 of the 630 assessed documents evaluated (5.56%). Similarly, the four
annotators agreed on the rating as “partially relevant” on 6 documents (0.95%); and others 123
documents (19.52%) were rated as “non-relevant”, giving us a total agreement among the four
annotators for 164 articles (26.03%). The Fleiss Kappa results are distributed into three levels of
agreement: “Poor”, with values from -0.1708 to 0.1885. “Fair”, with values from 0.2214 to 0.375; and
“Moderate” with values from 0.4564 to 0.5328. From the 42 reference articles, 24 got a Fleiss Kappa
corresponding to “Poor”, 14 corresponding to “Fair”, and 5 corresponding to “Moderate”. A table
summarizing the results about the inter-annotator agreement is available online [7].

Table 1
Agreement across annotators
        Relevance               Full agreement among        Agreement among
        categories              four annotators             three annotators
        Definitely relevant      35                          72
        Partially relevant       6                           40
        Non-relevant             123                         109
4. Discussion and conclusions

     This work is about the analysis done to the results obtained in an experiment focused on
evaluating the relevance between two articles. The experiment takes one document as a reference and
assesses a second document regarding its relevance to the reference one. The methodological aspects
involved the participation of four domain experts, who used an in-house annotation tool tailored to the
initial TREC data and the task at hand. The lack of agreement between annotators may be due to: i)
The abstract lacks key information. For example, the objective, main results or conclusions. In this
case, the reader has to search the entire document for the missing information. ii) Lack of experience
of the annotators in the evaluation of some topics. In this case, the reader must search for more
information on the web on a topic to better understand the document to be evaluated. Both
implications are time consuming. In order to overcome those limitations and improve the results, we
propose as a future work to extend the time required in the evaluation tasks and/or extend the number
of annotators to cover the lack of experience in some topics.

5. Acknowledgements
This work is part of the STELLA project funded by DFG (project no. 407518790). This work was
supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics
Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A,
031A537B, 031A537C, 031A537D, 031A538A).

6. References
[1] Hersh W, Cohen A, Yang J, Bhupatiraju RT, Roberts P, Hearst M. TREC 2005 Genomics Track
    Overview. : 26.
[2] Lin J, Wilbur WJ. PubMed related articles: a probabilistic topic-based model for content
    similarity. BMC Bioinformatics. 2007;8: 423. doi:10.1186/1471-2105-8-423
[3] Garcia Castro LJ, Berlanga R, Garcia A. In the pursuit of a semantic similarity metric based on
    UMLS annotations for articles in PubMed Central Open Access. Journal of Biomedical
    Informatics. 2015;57: 204–218. doi:10.1016/j.jbi.2015.07.015
[4] Wei W, Marmor R, Singh S, Wang S, Demner-Fushman D, Kuo TT, Hsu CN, Ohno-Machado L.
    Finding Related Publications: Extending the Set of Terms Used to Assess Article Similarity.
    AMIA Jt Summits Transl Sci Proc. 2016 Jul 20;2016:225-34.
[5] Talha M, Geist L, Fellerhoff T, Ravinder R, Giraldo O, Rebholz-Schuhmann D, et al.
    TREC-doc-2-doc-relevance assessment interface. Zenodo; 2022. doi:10.5281/zenodo.7341391
[6] Giraldo O, Solanki D, Cadena F, Robayo-Gama A, Rebholz-Schuhmann D, Castro LJ.
    Document-to-document relevant assessment for TREC Genomics Track 2005. Zenodo; 2022.
    doi:10.5281/zenodo.7324822
[7] Giraldo O, Solanki D, Rebholz-Schuhmann D, Castro LJ. Fleiss kappa for doc-2-doc relevance
    assessment. Zenodo; 2022. doi:10.5281/zenodo.7338056