<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Birger Larsen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman Jurowetzki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalborg University Business School</institution>
          ,
          <addr-line>Fibigerstraede 11, 9220 Aalborg East</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Aalborg University, Dept. of Communication and Psychology</institution>
          ,
          <addr-line>A. C. Meyers Vaenge 15, 2450 Copenhagen C</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <fpage>4</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>While some success has been achieved in the classification of citation contexts by using Large Language Models (LLMs) [1], significant improvements are still needed to achieve suficient accuracy for large scale real world application. We propose to use recent LLM models with AI reasoning capabilities (like QwQ, DeepSeek R1 and Gemini Thinking) to study if reasoning processes have the potential to achieve such performance improvements. We outline here how we plan to test this on two tasks: 1) classification of citation contexts, and 2) elimination of low-quality annotations from training sets. Full results will be presented at the 1st Scolia workshop. We also describe our experiences in creating a small annotated citation context dataset, and present some initial results.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Citation Context Classification</kwd>
        <kwd>Reasoning Models</kwd>
        <kwd>Data Annotation Cleaning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>A citation context can be defined as “that particular passage or statement within the citing document
containing the references” [2, p. 288]. Where most work in Bibliometrics and citation analysis has, in
one way or another, focused on the number of citations received, the analysis of citation contexts are
interesting because they may provide insights into why a given paper has been cited. With almost all
scientific text being produced, published and archived in electronic formats, and with an increasing share
of these as Open Access, citation contexts analysis may greatly enhance the understanding scientific
discourse, knowledge creation and claims, and the wider impact of science. If citation contexts can be
automatically extracted, processed and classified they have great potential for improving access to and
exploitation of scientific knowledge from scientific documents with a broad set of potential application,
e.g. in research analysis, science of science studies, patent and innovation analysis, information retrieval
and information visualization etc.</p>
      <p>
        Where early studies of citer motivations and referencing behavior largely was done manually and
qualitatively (see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for an overview), the advent of machine learning, large language Models (LLMs)
and generative AIs as well as the availability of a sizable proportion of scientific publications in electronic
formats, makes it possible to investigate if citation contexts can be automatically extracted and classified
- and with what accuracy.
      </p>
      <p>
        Even with scientific publications in electronic formats extracting citation contexts can be challenging.
Publications marked up in structured formats like XML can make identification and extraction of
citations contexts relatively straightforward. For instance, citation contexts can be easily identified
and extracted from PubMedCentral fulltext articles because of XML tags marking in-text references
(and linking them to the bibliography) [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Citation contexts can also be extracted from LaTeX
documents, albeit with some more dificulty and less accuracy [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Extraction from widely used, but
unstructured formats like PDF is a bigger challenge, but is possible with combination of several tools
and pipelines. For instance, the team at CORE is now able to reliably extract citation contexts as well as
salient document features from large corpora of PDFs from diverse sources [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Citation context classification using machine learning has mainly relied on supervised methods,
using large annotated datasets of citation contexts for fine-tuning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The performance of such models
depends mainly of the size of the training set, which has limited research in the area because annotations
are cumbersome to generate (ideally annotations should be done by scientists active in the research fields
in question). Recently, work has been done using generative models for citation context classification,
either by direct zero-shot prompting [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] , by combinations of prompting and LM fine-tuning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or by
zero to many-shot prompting [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Results are promising, but not consistently good enough to e.g. reach
the level of human annotators.
      </p>
      <p>Our long term goal is to investigate solutions where citation context classification can be brought to
a level of performance that would allow it to be used with confidence in real-world scenarios. This will
likely involve a combination of both creating annotated training sets and fine-tuning of existing models
on one hand, and on the other investigating where prompting can aid in improving and extending
training sets as well as citation context classification, incrementally combining these approaches over
time in several rounds to increase performance and scale up the research in citation context classification
to process large dirverse corpora of scientific publications reliably.</p>
      <p>In this paper, we propose to investigate if some of the most recently released LLMs with integrated
reasoning capabilities can significantly improve citation context classification performance as well as
assessing annotation quality. Reasoning models are attractive because they may allow us to get a deeper
insight into how the models arrive at their decisions and thus enable us to improve.</p>
      <p>Our contributions are:
• We describe our experiences in creating a small annotated citation context dataset, and present
some first results of applying it.
• We propose to use reasoning models for citation context classification, and outline several lines
of research that might benefit from their application.</p>
      <p>• We present some first results of applying reasoning models in assessing annotation quality.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>
        Our approach is situated within a broader trend in computational social science where LLMs are
increasingly explored as tools for qualitative and quantitative text analysis. LLMs ofer the potential to
automate and scale complex coding tasks, reduce the labor-intensive nature of manual annotation, and
potentially improve consistency and objectivity in social science research [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. For instance, LLMs have
shown promise in assisting deductive coding [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], supporting thematic analysis [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and providing
accurate annotations for complex texts, even outperforming human experts in certain scenarios [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>We leverage recent advancements in LLMs with integrated reasoning capabilities to address the
challenges in citation context classification and annotation quality assessment. We employ a zero-shot
classification methodology, directly prompting reasoning-enabled LLMs such as QwQ and DeepSeek R1
to classify citation contexts based on predefined categories. The selection of models like DeepSeek R1
is motivated by their demonstrated expert-level reasoning capabilities on complex tasks as well as the
fact that they expose their reasoning traces - unlike OpenAI’s o1 and o3 [14, 15].</p>
      <p>A key aspect of our methodology is the analysis of the reasoning traces generated by these models,
inspired by work on self-reflection in LLM agents [ 16]. These traces, which detail the step-by-step
thought process of the LLM in arriving at a classification decision, ofer a unique opportunity for several
valuable insights. Firstly, by examining these traces, we can gain a deeper understanding of how the LLM
interprets and applies our classification guidelines to the citation contexts. This allows us to iteratively
refine and improve these guidelines, making them more precise and less ambiguous for both automated
and human annotators. Secondly, the reasoning traces provide a basis for comparison with the cognitive
processes of human annotators who previously worked on creating the training dataset. By contrasting
the LLM’s reasoning with human annotation rationales, we can identify potential discrepancies, biases,
or areas of subjective interpretation inherent in the classification task.</p>
      <p>Checks whether the cited article is merely a part of the relevant literature and
is not analyzed or compared to other literature
Checks whether the given citation is explicitly characterized as “recent” by
the author.</p>
      <p>Checks whether results or opinions in a given citation show contrast to or
conflict with an opinion or result presented by the author of the citing paper.</p>
      <p>Checks if the results in the cited study are evaluated in the citing study
Checks whether the cited work helps explain the results or hypotheses in
current study
Checks whether the given citation is a methodology that was followed in the
citing work (with or without modifications)
Checks whether the given citation points to a data set from another source
that has been used (fully or in part) in the citing article
Checks whether the citing author expresses lack of certainty over a result
or opinion in the cited study
Checks whether results or opinions in the given citation are similar or
consistent with the given study or another cited work</p>
      <p>If none of the above can be assigned</p>
      <p>To quantitatively assess the agreement between the LLM classifications and the existing human
annotations, we calculate Inter-Rater Reliability (IRR) scores. Specifically, we plan to use Gwet’s AC1, a
robust IRR metric particularly suitable for situations with varying numbers of raters or when dealing
with class imbalance, which is often encountered in citation context classification tasks.</p>
      <p>In the context of citation context classification, reasoning-enabled LLMs could significantly enhance
our ability to process large volumes of scholarly literature, providing valuable insights into the dynamics
of scientific knowledge production and dissemination. By focusing on the reasoning process and
comparing it to human annotation, we aim to not only improve the accuracy of automated citation
context classification but also to gain a deeper understanding of the nuances and challenges inherent in
this task, ultimately contributing to more robust and reliable methods for scholarly information access.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>citing articles (sampling every nth reference from the reference list to make a total of 15 citation contexts
from each citing document). The advantage of this was to sample references across the document,
increasing the chance of annotating citations contexts from across each document, while minimizing
annotation costs.</p>
      <p>We hired two of our MSc students for annotation with a fixed number of hours to complete as many
annotations as possible from this sample. As we did not have access to biomedical students, both
students were in the field of Information Technology. Annotators worked with the XML fulltext of
the citing articles, and extracted: 1) the Xpath to the sentence containing the citation marker, 2) the
text of this sentence incl. any XML marking = the citation context, 3) the internal referenceID, 4)
Xpath to lowest (sub)section contain the citation context, 5) the title of this section (e.g. Introduction,
Discussion, Data Analysis, Statistical Analysis, Strengths and limitations of the data, etc.). In addition,
annotators assessed: 8) context citation types (using the scheme discussed above), 9) own confidence in
their assessment (High/Medium/Low), and 10) whether in their view additional context was needed for
interpretation, i.e. if one, two or more sentences before/after was needed).</p>
      <p>A total of 585 annotations were collected. Table 1 shows the distribution of citation contexts types
as well as the annotator confidence. As expected from previous research the Background/Perfunctory
class has the highest share (50%), followed by Explanation (22%), Similarity/Consistency (7%) and
Contrast/Conflict (6%). Other classes were 5% or less, and no contexts were assessed as Contemporary.
Overall, assessor confidence in their own judgments were High (87%), with 12% Medium and only 1%
Low. In order to highlight the distribution of H/M/L confidence levels, the percentages are tallied for
each context type. It can be seen that some of the less used context types has a high share of Medium
confidence, e.g. 11 out of 20 Method instances (55%) had Medium confidence - similar levels can be
seen for Modality (50%) and Evaluation (46%). Table 3 shows that in 80% of cases no additional context
was needed beyond the sentence with the citation marker, and that in 11% one sentence before was
needed with two sentences before needed in 4%. In only 2% of cases was one sentence after needed.</p>
      <p>The data allows us to study the distribution of citation context types of the seed document: 37 citation
contexts citing it were annotated: 16 (43%) were Background/Perfunctory, 12 (32%) were Explanation, 4
(11%) were Similarity/Consistency, 3 (8%) were Contrast/Conflict, and 2 (5%) were Evaluation.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Preliminary Experiments and Results</title>
      <p>To evaluate the feasibility of employing reasoning models for citation context classification, we
conducted initial experiments utilizing the DeepSeek R1 (DeepSeek-R1-Distill-Qwen-32B) Large Language
Model. We used the 585 citation contexts from our manually annotated dataset, detailed in Section 3. In
a zero-shot setting, each citation context, along with its section title and reference ID, was presented to
DeepSeek R1 with prompts designed to classify it into one of ten predefined categories (Section 2). The
model was instructed to output its classification in JSON format, accompanied by a reasoning trace
explaining its decision-making process.</p>
      <p>Following classification of all 585 contexts, we quantitatively assessed DeepSeek R1’s performance by
comparing its classifications to the human annotations in our dataset. We calculated standard evaluation
metrics, including Gwet’s AC1 inter-rater reliability, accuracy, precision, recall, and F1-score. These
metrics provide a comprehensive evaluation of the model’s agreement with human annotators and its
overall classification efectiveness.</p>
      <p>Analysis of the initial classification results revealed an overall accuracy of 0.508. Gwet’s AC1
interrater reliability was calculated at -0.798, indicating poor agreement beyond chance between the LLM
and human annotations across all categories. Examining class distribution, ’Background/Perfunctory’
(Class 1) was the most frequent category in human annotations (50.4%), while the LLM distributed its
classifications more broadly, notably increasing the proportion of ’Data’ (Class 7) and
’Similarity/Consistency’ (Class 9) classifications. The weighted-average F1-score, accounting for class imbalance, was
0.517. Per-class metrics showed that ’Background/Perfunctory’ (Class 1) achieved a precision of 0.732,
recall of 0.583, and an F1-score of 0.649. This higher performance is likely due to the class’s dominant
presence in the dataset. However, performance on other categories, particularly less frequent ones,
was considerably lower, suggesting room for improvement in aligning LLM classifications with human
annotations across the spectrum of citation context categories.</p>
      <p>To further understand the discrepancies between human and LLM classifications, we performed
a preliminary qualitative analysis on a subset of disagreements. Examining a manual sample of 15
instances, we observed a trend suggesting that the reasoning provided by DeepSeek R1 often indicated
classifications that were more contextually accurate than the original human (student) annotations.
The LLM demonstrated a nuanced understanding of citation function, frequently capturing subtle
contextual cues pointing to categories beyond the often-applied ’Background/Perfunctory’ label. The
reasoning traces highlighted the model’s sensitivity to linguistic markers of contrast, explanation, and
consistency, suggesting an ability to discern the rhetorical role of citations in scientific discourse. This
initial qualitative exploration suggests that human annotations, while the initial ground truth, could
benefit from critical review and revision. The observed discrepancies point to the need to further
refine annotation guidelines for greater consistency and to better capture the nuanced functional roles
of citations. This iterative process of comparing human and AI classifications holds promise for not
only improving automated citation context classification accuracy but also enhancing the quality and
reliability of the training data itself.</p>
      <p>The detailed results of this evaluation, including specific metric values and further analysis of the
model’s reasoning traces, will be presented at the SCOLIA’25 workshop. These preliminary experiments
provide a crucial initial step in understanding the capabilities of reasoning models for citation context
classification and will guide our subsequent research directions.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research was supported by a seed project grant by maSSHine, the Aalborg University computational
Social Sciences and Humanities initiative.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used the QwQ-32B reasoning model to predict
citation context categories, and for prompt improvements and instruction fixes through a
selftuning approach, as well as Gemma-3-27B for citation context classification using the augmented
prompts.
[14] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. . Wang, X. Bi,
DeepSeekR1: Incentivizing reasoning capability in LLMs via reinforcement learning, arXiv preprint
arXiv:2501.12948 (2025).
[15] D. Rein, B. Hou, A. Stickland, J. Petty, R. Pang, J. Dirani, J. Michael, S. Bowman, GPQA: A
graduate-level google-proof Q&amp;A benchmark, arXiv preprint arXiv:2311.12022 (2023).
[16] M. Renze, E. Guven, Self-reflection in LLM agents: Efects on problem-solving performance, arXiv
preprint arXiv:2405.06682v1 (2024).
[17] M. Chadeau-Hyam, B. Bodinier, J. Elliott, M. D. Whitaker, I. Tzoulaki, R. Vermeulen, M. Kelly-Irving,
C. Delpierre, P. Elliott, Risk factors for positive and negative covid-19 tests: a cautious and in-depth
analysis of UK biobank data, International Journal of Epidemiology 49 (2020) 1454–1467. URL:
http://dx.doi.org/10.1093/ije/dyaa134. doi:10.1093/ije/dyaa134.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Kunnath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pride</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Knoth</surname>
          </string-name>
          ,
          <article-title>Prompting strategies for citation classification</article-title>
          ,
          <source>in: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management</source>
          , CIKM '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>1127</fpage>
          -
          <lpage>1137</lpage>
          . URL: https: //doi.org/10.1145/3583780.3615018. doi:
          <volume>10</volume>
          .1145/3583780.3615018.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Small</surname>
          </string-name>
          ,
          <article-title>Citation context analysis</article-title>
          , in: B.
          <string-name>
            <surname>J. Dervin</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Voigt</surname>
          </string-name>
          (Eds.),
          <source>Progress in Communication Sciences</source>
          , volume
          <volume>3</volume>
          ,
          <year>1982</year>
          , pp.
          <fpage>287</fpage>
          -
          <lpage>310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Choubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Automatically classifying the role of citations in biomedical articles</article-title>
          ,
          <source>in: AMIA Annual Symposium Proceedings</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>T.-K. Hsiao</surname>
            ,
            <given-names>V. I. Torvik</given-names>
          </string-name>
          ,
          <article-title>OpCitance: Citation contexts identified from the pubmed central open access articles</article-title>
          ,
          <source>Scientific Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Skau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Larsen</surname>
          </string-name>
          ,
          <article-title>Optimal citation context window sizes for biomedical retrieval</article-title>
          ,
          <source>CEUR Workshop Proceedings of the 8th International Workshop on BibliometricEnhanced Information Retrieval, BIR 2019 - Cologne, Germany</source>
          <volume>2345</volume>
          (
          <year>2019</year>
          )
          <fpage>51</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dabrowska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Larsen</surname>
          </string-name>
          ,
          <article-title>Exploiting citation contexts for physics retrieval</article-title>
          ,
          <source>CEUR Workshop Proceedings of the Second Workshop on Bibliometric-enhanced Information Retrieval : co-located with the 37th European Conference on Information Retrieval (ECIR</source>
          <year>2015</year>
          )
          <volume>1344</volume>
          (
          <year>2015</year>
          )
          <fpage>14</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nambanoor Kunnath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pride</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Botev</surname>
          </string-name>
          , P. Knoth,
          <article-title>ACT2: A multi-disciplinary semi-structured dataset for importance and purpose classification of citations</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blache</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.),
          <source>Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2022</year>
          , pp.
          <fpage>3398</fpage>
          -
          <lpage>3406</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .lrec-
          <volume>1</volume>
          .363/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Nishikawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Koshiba</surname>
          </string-name>
          ,
          <article-title>Exploring the applicability of large language models to citation context analysis</article-title>
          ,
          <source>Scientometrics</source>
          <volume>129</volume>
          (
          <year>2024</year>
          )
          <fpage>6751</fpage>
          -
          <lpage>6777</lpage>
          . URL: https://ideas.repec.org/a/spr/scient/ v129y2024i11d10.1007_
          <fpage>s11192</fpage>
          -
          <fpage>024</fpage>
          -05142-9.html. doi:
          <volume>10</volume>
          .1007/s11192-024-
          <fpage>05142</fpage>
          -.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Koloveas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chatzopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vergoulis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tryfonopoulos</surname>
          </string-name>
          ,
          <article-title>Can LLMs predict citation intent? an experimental analysis of in-context learning and fine-tuning on open LLMs, 2025</article-title>
          . URL: https: //arxiv.org/abs/2502.14561. arXiv:
          <volume>2502</volume>
          .
          <fpage>14561</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ziems</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Held</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Shaikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Can large language models transform computational social science?</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>50</volume>
          (
          <year>2024</year>
          )
          <fpage>237</fpage>
          -
          <lpage>291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Chew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bollenbacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wenger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Speer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>LLM-assisted content analysis: Using large language models to support deductive coding</article-title>
          ,
          <source>arXiv preprint arXiv:2311.18716</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.-C.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , L.-W. Ku,
          <article-title>LLM-in-the-loop: Leveraging large language model for thematic analysis</article-title>
          ,
          <source>arXiv preprint arXiv:2310.15100</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Törnberg</surname>
          </string-name>
          ,
          <article-title>Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning</article-title>
          ,
          <source>arXiv preprint arXiv:2304.06588</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>