<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>for Relevance Judgments in Tetun</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriel de Jesus</string-name>
          <email>gabriel.jesus@inesctec.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sérgio Nunes</string-name>
          <email>sergio.nunes@fe.up.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Washington DC, United States.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FEUP - Faculty of Engineering, University of Porto</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>INESC TEC - Institute for Systems and Computer Engineering, Technology and Science</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <fpage>4</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The Cranfield paradigm has served as a foundational approach for developing test collections, with relevance judgments typically conducted by human assessors. However, the emergence of large language models (LLMs) has introduced new possibilities for automating these tasks. This paper explores the feasibility of using LLMs to automate relevance assessments, particularly within the context of lowresource languages. In our study, LLMs are employed to automate relevance judgment tasks, by providing a series of query-document pairs in Tetun as the input text. The models are tasked with assigning relevance scores to each pair, where these scores are then compared to those from human annotators to evaluate the inter-annotator agreement levels. Our investigation reveals results that align closely with those reported in studies of high-resource languages.</p>
      </abstract>
      <kwd-group>
        <kwd>Large language models</kwd>
        <kwd>Relevance judgments</kwd>
        <kwd>Low-resource languages</kwd>
        <kwd>Tetun</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        The advancement of information retrieval (IR) systems depends on the availability of reliable
test collections to assess their efectiveness. The traditional approach for developing these
collections follows the Cranfield paradigm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which became widely recognized through the
Text REtrieval Conference (TREC) series of large-scale evaluation campaigns [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In TREC
guidelines, a test collection comprises a document collection, a set of topics, and corresponding
relevance assessments. The relevance judgment tasks are typically carried out by human
assessors, a process that is both time-consuming and costly.
      </p>
      <p>To tackle the aforementioned problems, the IR community has been investigating the
feasibility of automatically generated relevance judgments for developing test collections. With the
advent of large language models (LLMs), which have demonstrated proficiency in various tasks,
new possibilities for conducting automated relevance judgments have emerged, demonstrating
ongoing improvement in the quality of automated relevance judgment tasks as LLMs continue</p>
      <p>
        Studies have consistently shown that LLMs are efective in automated relevance assessment
tasks, providing their cost-efectiveness solutions with judgment agreement comparable to
human assessors. Faggioli et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] argued that although further improvement in LLMs capabilities
is necessary for fully automated relevance judgments, LLMs are already capable of assisting
humans in this task. Additionally, a recent study by Bueno et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] reported a consistent
improvement in automated relevance judgments with an average Cohen’s Kappa score of 0.31
for annotation agreement between humans and LLMs, which are inline with the findings of
Faggioli et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, these studies primarily focus on high-resource languages, such as
English and Brazilian Portuguese, leaving the applicability of LLMs in low-resource language
(LRL) contexts as an open question.
      </p>
      <p>
        In this study, we explore the use of LLMs to automate relevance judgment tasks in Tetun,
a LRL spoken by over 923,000 people in Timor-Leste [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We used an existing test collection
comprising 6,100 relevance judgments constructed utilizing documents from the
Labadain30k+ dataset [6]. The relevance judgments for this collection were conducted by native Tetun
speakers. These query-document pairs were provided to the LLMs to assign relevance scores
for each. We compared these scores with those from human annotations and observed
interannotator agreement levels. The results revealed an inter-annotator agreement of Cohen’s
kappa score of 0.2634 when evaluated using the 70B variant of the LLaMA3 model [7]. This
ifnding demonstrates the feasibility of using LLMs in LRL scenarios to automate the relevance
judgment tasks.
      </p>
      <p>The remaining sections of this paper are organized as follows. Section 2 describes related
work. An overview of the collection used in this study is outlined in Section 3. Then, Section 4
details the experiment of using LLMs for automating relevance judgments. Section 5 presents
the results obtained and their discussion. Finally, Section 6 summarizes our conclusion and
possible future work.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>Test collections are the most important component used for evaluating the efectiveness of IR
systems. For high-resource languages, these collections are typically made available through
large-scale campaigns such as Text REtrieval Conference (TREC)1, the Conference and Labs
of the Evaluation Forum (CLEF)2, the NII Testbeds and Community for Information Access
Research project (NTCIR)3, and the Forum for Information Retrieval Evaluation (FIRE)4.</p>
      <p>
        The TREC-style approach, derived from the Cranfield paradigm, is commonly adopted for
developing test collections, including for low-resource languages (LRLs), where human assessors
conduct the relevance judgment tasks [8, 9, 10, 11]. However, the fast pace of research and
innovation, particularly with the emergence of LLMs, has significantly transformed natural
language processing (NLP). Within the IR domain, studies have demonstrated that automated
relevance judgments using LLMs can yield results comparable to traditional methods, and
1https://trec.nist.gov
2https://www.clef-initiative.eu
3http://research.nii.ac.jp/ntcir/index-en.html
4http://fire.irsi.res.in/
these outcomes have consistently improved as LLMs have evolved. Initially, Faggioli et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
explored the potential application of LLMs to fully-automated relevance judgment tasks. They
analyzed the judgment results from the TREC 2021 Deep Learning track [12] and compared them
with LLM-based relevance assessments generated using GPT-3.5 of OpenAI5. Their findings
revealed a Cohen’s kappa score of 0.26 for inter-annotator agreement between human and
LLM, indicating a fair level of agreement. Thus, they argued that LLMs are already capable of
assisting humans in relevance judgment tasks, despite further improvement in LLM capabilities
are necessary for fully automated relevance judgments.
      </p>
      <p>
        Later, Thomas et al. [13] reported that LLMs demonstrated accuracy comparable to human
labelers when deployed for large-scale relevance labeling at Bing. Their work utilized the GPT-4
model [14] and incorporated data from the TREC Robust04 track [15], showing that LLMs
achieved a Cohen’s kappa score ranging from 0.20 to 0.64 for agreement between humans and
LLM across various tasks. In a recent study, Bueno et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in their study while constructing a
test collection for Brazilian Portuguese, reported consistent improvement and findings
comparable to those of Thomas et al. [13] and Faggioli et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], with automated relevance judgments
yielding an average Cohen’s Kappa score of 0.31 for annotation agreement between humans
and LLMs.
      </p>
      <p>Despite these advancements, uncertainties persist about the feasibility of using LLMs to
automatically generating relevance judgments for LRLs. Thus, our research focuses on exploring
this potential application in LRL scenarios, specifically in Tetun.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Collection Overview</title>
      <p>In this experiment, we utilized the existing Tetun test collection6 developed according to TREC
guidelines. The following subsections detail the test collection used in this work.</p>
      <sec id="sec-4-1">
        <title>3.1. Documents</title>
        <p>Documents of the Tetun test collection are derived from the Labadain-30k+ dataset, which
consists of 33,550 documents in Tetun [6]. This dataset was acquired from the web and
encompassed a broad array of categories, including news articles, Wikipedia entries, legal and
government documents, research papers, and more [16]. A summary of the document collection
is provided in Table 1.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Queries</title>
        <p>The collection consists of 61 queries developed by five volunteer students, all Timoreses and
native Tetun speakers. The queries are originated from the logs of Timor News7, an online
newspaper based in Dili, Timor-Leste. Statistics about the queries are presented in Table 2.
5https://openai.com
6This collection has not yet been published.
7https://www.timornews.tl</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Relevance Judgments</title>
        <p>
          Relevance judgments were conducted by the same five Timorese students. These students were
tasked with evaluating the relevance of query-document pairs. The pairs were classified into
four graded levels of topical relevance: irrelevant, marginally relevant, relevant, and highly
relevant, as proposed by Sormunen [
          <xref ref-type="bibr" rid="ref6">17</xref>
          ]. The inter-annotator agreement achieved an average a
Cohen’s kappa score of 0.4236 and the details of the resulting test collection are presented in
Table 3.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Relevance Judgments Using LLMs</title>
      <sec id="sec-5-1">
        <title>4.1. Overview</title>
        <p>
          Several studies have already utilized the GPT-3.5 and GPT-4 models from OpenAI for automating
relevance judgment tasks [
          <xref ref-type="bibr" rid="ref3 ref4">3, 13, 4</xref>
          ]. However, due to the costs associated with these LLMs, our
study explores an alternative by employing the freely available 70B variant of LLaMA3, released
by Meta on April 18, 2024 [
          <xref ref-type="bibr" rid="ref7">18</xref>
          ]. We conduct automated relevance judgments using the Tetun
test collection detailed in section 3, to compare their inter-annotator agreement levels.
        </p>
        <p>Additionally, to evaluate whether the free LLaMA3 model of 70B variant can outperform
certain paid LLMs in relevance assessment tasks, specifically within the Tetun context, we have
selected two paid models for comparison: the Haiku variant of Claude 3 from Anthropic8, and
the Turbo variant of GPT-3.5 from OpenAI. A summary of the models used, along with their
associated costs, is presented in Table 4.</p>
        <p>To assess the suitability of the chosen LLMs for Tetun, including the two paid models, we
conducted preliminary tests that involved translating Tetun text into English. This step was
essential given that the query-document pairs are written in Tetun. Examples of these translated
outputs are presented in Table 5, showing that LLaMA3 inaccurately translated two words, as
indicated by strike-through markings.</p>
        <p>
          To evaluate the quality of the translated text generated by the LLMs, we randomly selected a
sample of five documents from the query-document pairs (see example in Table 8), and translated
them into English. These human translations served as reference points for evaluation. The
assessment using the BLEU metric [
          <xref ref-type="bibr" rid="ref8">19</xref>
          ], demonstrates that both paid models outperformed
LLaMA3 in translating Tetun to English, as shown in Table 6.
        </p>
        <p>
          However, given that relevance judgment tasks require not only direct translation but also
a nuanced level of understanding, we compared the selected models’ multi-task language
understanding capabilities using the Massive Multitask Language Understanding (MMLU) [
          <xref ref-type="bibr" rid="ref9">20</xref>
          ]
based on the MMLU benchmark leaderboard [
          <xref ref-type="bibr" rid="ref10">21</xref>
          ]. A summary of these LLMs’ performance on
MMLU is outlined in Table 7. It shows that in the few-shot scenario with five examples, LLaMA
3 surpassed Claude 3 Haiku by an average of +5 percentage points and GPT-3.5 Turbo by +10.2
percentage points.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Experiment with Tetun</title>
        <p>Document
UNFPA Sei Koopera ho MS Hodi halo Prevensaun ba Moras
HIV/SIDA
KNK-HIV/SIDA Sensibiliza Informasaun HIV/SIDA Ba
Traballador KSTL
Autoridade Lokál Partisipa Workshop Prevensaun Moras</p>
        <p>
          HIV/SIDA
To automate relevance judgments using LLMs, we utilized few-shot prompting, adopting a
structure similar to that employed by Bueno et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Our prompt, along with an example, is
illustrated in Prompt 4.1, and the full prompt is outlined in Appendix A. We provided the LLMs
with a total of 6,100 query-document pairs and tasked the LLMs with assigning a relevance
score to each. Examples of these query-document pairs are depicted in Table 8.
        </p>
        <p>Given that the existing Tetun test collection employs four-level relevance scores ranging
from 0 to 3, we provided the LLMs with query-document pairs alongside four examples, one
for each relevance score. These examples used the same queries as those utilized in the pilot
testing phase by human assessors, including the relevance score and the reasoning behind each
score. For each request, we asked the LLMs to assign one of the four scores and provide the
reasoning for their assigned score.</p>
        <p>For the 70B variant of the LLaMA 3 model, which requires a substantial amount of memory
to run locally, specifically a minimum of 40 GB of RAM as indicated on Ollama 9, we utilized the
free API version of the cloud infrastructure provided by Groq10 to execute this model. However,
the scripts for automated relevance judgments for all models were executed locally.</p>
        <sec id="sec-5-2-1">
          <title>Prompt 4.1: Example of the System Prompt.</title>
          <p>You are an expert assessor and you are tasked with assessing the relevance
between the input query and its corresponding document, assigning a score from 0 to 3.
A score of 0 indicates irrelevant; 1, marginally relevant; 2, relevant; and 3, highly relevant.
Example:
query: “Kursu mestradu no pós-graduasaun UNTL”
document: “Kursu Desportu UNTL sei realiza graduasaun dahuluk tinan ne’e”
reason: “The query is about postgraduate and master’s courses at UNTL, whereas the
document focuses on a sports course. Despite both courses in the query and document
being ofered at UNTL, the sports course in the document is not specifically designed for
postgraduate or master’s levels. Thus, the document is only marginally relevant.”
score: 1</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>The query and document to be evaluated are the following:</title>
          <p>query: {  }
document: {}
Your response must be in JSON format with the first field is “reason”,
explaining your reasoning, and the second field is “score”.</p>
          <p>
            We initiated the experiment with the LLaMA3 70B model, as it was our primary target for
comparing annotator agreement level with human annotators. We tested this model using
temperatures of 0.0 and 0.5, respectively. The concept of comparing diferent model temperatures
in inter-annotator agreement was inspired by the work of Ma et al. [
            <xref ref-type="bibr" rid="ref11">22</xref>
            ], who applied LLMs
for relevance judgments in Chinese legal case retrieval. When we increased the temperature
of LLaMA3 70B model, the results were not satisfactory. Therefore, we opted to use a zero
temperature setting in the other paid models for comparison.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results and Discussions</title>
      <p>
        In the experiment with the LLaMA3 70B model set at zero temperature, we obtained an
interannotator agreement of Cohen’s kappa score of 0.2634 with human annotators. After increasing
the temperature to 0.5, the inter-annotator agreement slightly decreased to 0.2594 (a reduction
of -0.004). This finding aligns with the research by Ma et al. [
        <xref ref-type="bibr" rid="ref11">22</xref>
        ], where their Cohen’s kappa
score of inter-annotator agreement levels between humans and LLMs also marginally decreased
10https://console.groq.com/settings/billing
when they raised the temperature from 0.4 to 0.7 in evaluations of material facts.
      </p>
      <p>Consequently, we opted for a zero temperature setting when conducting relevance judgments
with the Claude3 Haiku and GPT-3.5 Turbo models. Comparisons of the inter-annotator
agreement levels between LLMs and human annotators are presented in Table 9. These results
show that the LLaMA3 70B model achieved a highest Cohen’s kappa score, indicating the
most substantial agreement with human annotators compared to both paid models. Among
the paid models, GPT-3.5 Turbo exhibited a slightly higher Cohen’s kappa score than Claude3
Haiku (a k score increase of +0.0012). Thus, despite the superior performance of the paid
models in translating Tetun into English, this finding suggests that a deeper level of language
understanding is more crucial in automated relevance judgment tasks.</p>
      <p>
        As a result, our finding using LLaMA3 70B model is closely aligned with the initial results
reported by Faggioli et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and are consistent with the findings of Bueno et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Thomas
et al. [13]. Comparisons of these findings regarding the use of LLMs to automate relevance
judgments are presented in Table 10.
      </p>
      <p>Furthermore, our experiments took an average of approximately 3.56 hours to complete the
relevance judgment tasks for each model. The costs associated with the two paid models are
detailed in Table 11. Given that GPT-3.5 Turbo is priced $0.25 higher per use than Claude 3
Haiku for every 1 million input and output tokens, the expenses for GPT-3.5 were higher than
those for Claude 3 Haiku.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusions and Future Work</title>
      <p>Our exploration into leveraging large language models for automating relevance judgment tasks
in low-resource language scenarios, demonstrated using Tetun, has yielded results comparable
to those achieved in high-resource languages, thus encouraging further research in low-resource
languages (LRLs). The availability of freely and openly accessible models like LLaMA3 opens
up possibilities for advancing relevance judgment tasks, particularly in low-resource language
contexts, even with the limited digital content available on the web.</p>
      <p>
        Our experiment demonstrated that despite LLaMA3’s knowledge being limited to December
202311 and the availability of fewer than 45k Tetun documents on the web by that time [
        <xref ref-type="bibr" rid="ref12">23, 16</xref>
        ], it
achieved an agreement level comparable to high-resource languages like English. This indicates
that automated relevance judgment tasks are feasible for other LRLs as well.
      </p>
      <p>In future work, we plan to extend this research by incorporating a wider variety of examples
in our prompts and testing with other freely and openly available models to compare the results.
This approach will help validate and potentially expand the use of large language models in
relevance judgment tasks.</p>
    </sec>
    <sec id="sec-8">
      <title>7. Acknowledgment</title>
      <p>This work is financed by National Funds through the Portuguese funding agency,
FCT - Fundação para a Ciência e a Tecnologia, within project LA/P/0063/2020 (DOI
10.54499/LA/P/0063/2020) and the Ph.D. scholarship grant number SFRH/BD/151437/2021 (DOI
10.54499/SFRH/BD/151437/2021).
11https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
mation Retrieval - 45th European Conference on Information Retrieval, ECIR 2023, Dublin,
Ireland, April 2-6, 2023, Proceedings, Part III, volume 13982 of Lecture Notes in Computer
Science, Springer, 2023, pp. 429–435. URL: https://doi.org/10.1007/978-3-031-28241-6_48.
doi:10.1007/978-3-031-28241-6\_48.
[6] G. de Jesus, S. Nunes, Labadain-30k+: A monolingual Tetun document-level audited dataset
[data set]. INESC TEC, https://doi.org/10.25747/YDWR-N696, 2024.
[7] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and
eficient foundation language models, CoRR abs/2302.13971 (2023). URL: https://doi.org/
10.48550/arXiv.2302.13971. doi:10.48550/ARXIV.2302.13971. arXiv:2302.13971.
[8] S. S. Sahu, S. Pal, Building a text retrieval system for the sanskrit language: Exploring
indexing, stemming, and searching issues, Comput. Speech Lang. 81 (2023) 101518. URL:
https://doi.org/10.1016/j.csl.2023.101518. doi:10.1016/J.CSL.2023.101518.
[9] C. Chavula, H. Suleman, Ranking by language similarity for resource scarce southern
bantu languages, in: F. Hasibi, Y. Fang, A. Aizawa (Eds.), ICTIR ’21: The 2021 ACM SIGIR
International Conference on the Theory of Information Retrieval, Virtual Event, Canada,
July 11, 2021, ACM, 2021, pp. 137–147. URL: https://doi.org/10.1145/3471158.3472251.
doi:10.1145/3471158.3472251.
[10] K. S. Esmaili, D. Eliassi, S. Salavati, P. Aliabadi, A. Mohammadi, S. Yosefi, S. Hakimi, Building
a test collection for sorani kurdish, in: ACS International Conference on Computer Systems
and Applications, AICCSA 2013, Ifrane, Morocco, May 27-30, 2013, IEEE Computer Society,
2013, pp. 1–7. URL: https://doi.org/10.1109/AICCSA.2013.6616470. doi:10.1109/AICCSA.
2013.6616470.
[11] A. AleAhmad, H. Amiri, E. Darrudi, M. Rahgozar, F. Oroumchian, Hamshahri: A standard
persian text collection, Knowl. Based Syst. 22 (2009) 382–387. URL: https://doi.org/10.1016/
j.knosys.2009.05.002. doi:10.1016/J.KNOSYS.2009.05.002.
[12] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, J. Lin, Overview of the TREC 2021 deep
learning track, in: I. Soborof, A. Ellis (Eds.), Proceedings of the Thirtieth Text REtrieval
Conference, TREC 2021, online, November 15-19, 2021, volume 500-335 of NIST Special
Publication, National Institute of Standards and Technology (NIST), 2021. URL: https:
//trec.nist.gov/pubs/trec30/papers/Overview-DL.pdf.
[13] P. Thomas, S. Spielman, N. Craswell, B. Mitra, Large language models can accurately
predict searcher preferences, CoRR abs/2309.10621 (2023). URL: https://doi.org/10.48550/
arXiv.2309.10621. doi:10.48550/ARXIV.2309.10621. arXiv:2309.10621.
[14] OpenAI, GPT-4 technical report, CoRR abs/2303.08774 (2023). URL: https://doi.org/10.</p>
      <p>48550/arXiv.2303.08774. doi:10.48550/ARXIV.2303.08774. arXiv:2303.08774.
[15] E. M. Voorhees, Overview of the TREC 2004 robust track, in: E. M. Voorhees, L. P.</p>
      <p>Buckland (Eds.), Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004,
Gaithersburg, Maryland, USA, November 16-19, 2004, volume 500-261 of NIST Special
Publication, National Institute of Standards and Technology (NIST), 2004. URL: http://trec.
nist.gov/pubs/trec13/papers/ROBUST.OVERVIEW.pdf.
[16] G. de Jesus, S. Nunes, Labadain-30k+: A monolingual Tetun document-level audited
dataset, in: M. Melero, S. Sakti, C. Soria (Eds.), Proceedings of the 3rd Annual Meeting of
the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, ELRA</p>
    </sec>
    <sec id="sec-9">
      <title>A. System Prompt Details</title>
      <p>Details of the system prompt used in the automated relevance judgments, including four
examples of query-document pairs along with the reasoning and the corresponding score for
each.</p>
      <p>You are an expert assessor and you are tasked with assessing the relevance
between the input query and its corresponding document, assigning a score from 0 to 3.
A score of 0 indicates irrelevant; 1, marginally relevant; 2, relevant; and 3, highly relevant.
Example 1:
query: “Programa mestradu no pós-graduasaun UNTL”
document: “Estudantes Pós-Graduasaun IOB Kuda Ai-Oan iha aldeia Payol no Bedois”
reason: “The query is about postgraduate and master’s courses at UNTL, whereas the
document discusses the activities of postgraduate students from IOB. Although both
query and document contain the term ’postgraduate’, the query specifically is targeted
courses at UNTL. Therefore, they are irrelevant.”
score: 0.</p>
      <p>Example 2:
query: “Kursu mestradu no pós-graduasaun UNTL”
document: “Kursu Desportu UNTL sei realiza graduasaun dahuluk tinan ne’e”
reason: “The query is about postgraduate and master’s courses at UNTL, whereas the
document focuses on a sports course. Despite both courses in the query and document
being ofered at UNTL, the sports course in the document is not specifically designed for
postgraduate or master’s levels. Thus, the document is only marginally relevant.”
score: 1.</p>
      <p>Example 3:
query: “Kursu mestradu no pós-graduasaun UNTL”
document: “UNTL Nia Vise Reitór Asuntu Pós-Graduasaun No Peskiza Hakotu-iis”
reason: “The document is relevant as it details the vice-director of the postgraduate
program at UNTL. However, its relevance is somewhat diminished as it primarily
discusses the unfortunate passing of the vice-director rather than the progress or
implementation of the program. Hence, they are relevant.”
score: 2.</p>
      <sec id="sec-9-1">
        <title>Example 4:</title>
        <p>query: “Kursu mestradu no pós-graduasaun UNTL”
document: “UNTL Lansa Kursu Pós-Graduasaun No Mestradu Iha Área Lima”
reason: “Both the query and document address postgraduate and master’s courses at
UNTL. The document strongly correlates with the query, containing the launching of
postgraduate and master’s courses at UNTL. Thefore they are highly relevant.”
score: 3.</p>
      </sec>
      <sec id="sec-9-2">
        <title>The query and document to be evaluated are the following:</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cleverdon</surname>
          </string-name>
          ,
          <article-title>The cranfield tests on index language devices, in: Aslib proceedings</article-title>
          , volume
          <volume>19</volume>
          ,
          <string-name>
            <surname>MCB</surname>
            <given-names>UP</given-names>
          </string-name>
          <string-name>
            <surname>Ltd</surname>
          </string-name>
          ,
          <year>1967</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Harman</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of The First Text REtrieval Conference</source>
          , TREC 1992, Gaithersburg, Maryland, USA, November 4-
          <issue>6</issue>
          ,
          <year>1992</year>
          , volume
          <volume>500</volume>
          -207 of NIST Special Publication,
          <source>National Institute of Standards and Technology (NIST)</source>
          ,
          <year>1992</year>
          . URL: http: //trec.nist.gov/pubs/trec1/t1_proceedings.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dietz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , G. Demartini,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          , E. Kanoulas,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <article-title>Perspectives on large language models for relevance judgment</article-title>
          , in: M.
          <string-name>
            <surname>Yoshioka</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiseleva</surname>
          </string-name>
          , M. Aliannejadi (Eds.),
          <source>Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval</source>
          ,
          <string-name>
            <surname>ICTIR</surname>
          </string-name>
          <year>2023</year>
          , Taipei, Taiwan, 23
          <source>July</source>
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>50</lpage>
          . URL: https://doi.org/10.1145/3578337.3605136. doi:
          <volume>10</volume>
          .1145/3578337.3605136.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bueno</surname>
          </string-name>
          , E. S. de Oliveira,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Lotufo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>Quati: A brazilian portuguese information retrieval dataset from native speakers</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2404</volume>
          .
          <fpage>06976</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>G. de Jesus</surname>
          </string-name>
          ,
          <article-title>Text information retrieval in Tetun</article-title>
          , in: J.
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Kruschwitz</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Caputo (Eds.), Advances in Inforand ICCL,
          <string-name>
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>188</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .sigul-
          <volume>1</volume>
          .
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>E.</given-names>
            <surname>Sormunen</surname>
          </string-name>
          ,
          <article-title>Liberal relevance criteria of TREC -: counting on negligible documents?</article-title>
          , in: K. Järvelin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beaulieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          Myaeng (Eds.),
          <source>SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 11-15</source>
          ,
          <year>2002</year>
          , Tampere, Finland,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2002</year>
          , pp.
          <fpage>324</fpage>
          -
          <lpage>330</lpage>
          . URL: https://doi.org/10.1145/564376.564433. doi:
          <volume>10</volume>
          .1145/564376.564433.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Meta</surname>
          </string-name>
          ,
          <article-title>Introducing meta llama 3: The most capable openly available llm to date, 2024</article-title>
          . URL: https://llama.meta.com/llama3/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          , in: P.
          <string-name>
            <surname>Isabelle</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Charniak</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          (Eds.),
          <article-title>Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://aclanthology.org/P02-1040. doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazeika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <article-title>Measuring massive multitask language understanding</article-title>
          ,
          <source>in: 9th International Conference on Learning Representations, ICLR</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          , Austria, May 3-
          <issue>7</issue>
          ,
          <year>2021</year>
          , OpenReview.net,
          <year>2021</year>
          . URL: https://openreview.net/forum?id=d7KBjmI3GmQ.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [21]
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>with Code, Multi-task language understanding on mmlu</article-title>
          ,
          <year>2024</year>
          . URL: https:// paperswithcode.com
          <article-title>/sota/multi-task-language-understanding-on-mmlu.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <article-title>Leveraging large language models for relevance judgments in legal case retrieval</article-title>
          ,
          <source>CoRR abs/2403</source>
          .18405 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv. 2403.18405. doi:
          <volume>10</volume>
          .48550/ARXIV.2403.18405. arXiv:
          <volume>2403</volume>
          .
          <fpage>18405</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kudugunta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Caswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Choquette-Choo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kusupati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bapna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Firat</surname>
          </string-name>
          , MADLAD-400:
          <article-title>A multilingual and documentlevel large audited dataset</article-title>
          ,
          <source>CoRR abs/2309</source>
          .04662 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/ arXiv.2309.04662. doi:
          <volume>10</volume>
          .48550/ARXIV.2309.04662. arXiv:
          <volume>2309</volume>
          .
          <fpage>04662</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>