<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>URL: https://arxiv.org/abs/</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Measuring What Matters: Probing Transit Reasoning Consistency in Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hariram Veeramani</string-name>
          <email>hariram@ucla.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Surendrabikram Thapa</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Usman Naseem</string-name>
          <email>usman.naseem@mq.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing, Macquarie University</institution>
          ,
          <addr-line>Sydney, NSW, 2113</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of California Los Angeles</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Virginia Tech</institution>
          ,
          <addr-line>Blacksburg, Virginia, 24060</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2211</year>
      </pub-date>
      <volume>05100</volume>
      <abstract>
        <p>We propose a micro benchmark along with a comprehensive evaluation framework for transit-domain Large Language Models that transcends traditional accuracy metrics by probing in-context learning capabilities and multi-step reasoning processes. Our approach introduces four complementary evaluation paradigms such as Perturbation Chains, Narrative Coherence Checks, Minimal Edit Plausibility, and Cross-Modal Anchoring, that collectively assess how models adapt, reason, and maintain consistency under domain-specific constraints. Through systematic evaluation of four state-of-the-art models, we demonstrate substantial performance disparities in cascading reasoning scenarios despite similar baseline accuracy, revealing fundamental limitations in current evaluation methodologies. Our framework along with the benchmark provides actionable insights for post-training optimization strategies, enables principled comparison of retrieval-augmented versus tool-calling architectures, and establishes theoretical foundations for deploying specialized smaller models in safety-critical transit applications. The benchmark and evaluation suite will be shared with community along with further extended studies.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Composite Reasoning</kwd>
        <kwd>Multi-step Reasoning</kwd>
        <kwd>KG Reasoning</kwd>
        <kwd>Agentic Systems</kwd>
        <kwd>LLM Consistency</kwd>
        <kwd>LLM Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models have demonstrated remarkable capabilities across diverse reasoning tasks [
        <xref ref-type="bibr" rid="ref1">1, 2</xref>
        ],
from mathematical problem-solving [3, 4] and code generation [5, 6] to commonsense reasoning [7, 8, 9]
and logical inference [10, 11]. This success has motivated their deployment in increasingly complex
real-world applications, including safety-critical domains such as public transit systems. Recent studies
report that LLMs achieve accuracy rates exceeding 90% on General Transit Feed Specification (GTFS)
tasks [12, 13], suggesting readiness for production deployment. However, these metrics fundamentally
measure task completion rather than the underlying reasoning capabilities essential for real-world
reliability. When passengers pose complex queries such as "Given current service disruptions, what
alternative routes minimize both travel time and transfers while avoiding construction zones?", the
system must demonstrate sophisticated in-context learning, multi-step reasoning, and adaptive
problemsolving capabilities that traditional accuracy metrics cannot capture.
      </p>
      <p>This discrepancy between measured performance and required reasoning capabilities represents
a critical gap in current evaluation methodologies. Transit systems operate under strict safety and
reliability constraints where reasoning failures can cascade into significant user impact. A system that
achieves high accuracy on isolated queries but fails to maintain logical consistency under perturbations
poses substantial deployment risks.</p>
      <p>Our work addresses this evaluation gap through four framework contributions. First, we formalize
mathematical frameworks that probe distinct dimensions of reasoning quality in transit-domain
applications. Second, we demonstrate how these frameworks reveal fundamental diferences in in-context
learning capabilities across model architectures. Third, we propose qualitative connections between
evaluation outcomes and post-training optimization strategies, including supervised fine-tuning and
reinforcement learning with focus on relatively smaller language models in domain-specific evaluation
contexts, drawing on recent advances in agentic AI systems [14].</p>
    </sec>
    <sec id="sec-2">
      <title>2. Multi-Dimensional Transit Reasoning Framework</title>
      <p>||−1
Let  = (, ,  ) represent a GTFS dataset where  denotes stops,  represents routes, and 
encompasses scheduled trips. Traditional evaluation computes binary accuracy as (, ) =
∑︀|=|1 1[ () = ] for model  , query set , and ground truth responses . While
computationally eficient, this formulation provides no insight into reasoning processes, failure propagation
mechanisms, or in-context adaptation capabilities.</p>
      <p>We propose a comprehensive evaluation framework Φ = {,  , ℳℰ , ℳ}
probe fundamental reasoning dimensions that emerge in transit-domain applications.
designed to</p>
      <p>Perturbation Chain Analysis. The Perturbation Chain framework () probes in-context learning
robustness through systematic cascade testing. For base query 0 and perturbation sequence {}=1,
we construct modified queries  = (1 ) that incrementally alter system state. The reasoning
consistency score quantifies degradation patterns:</p>
      <p>−
=1</p>
      <p>RCS(, 0) = ∏︁ P[valid( ()) | valid( (−1 ))]
where valid(·) indicates logical consistency with perturbed GTFS state. This formulation captures
how efectively models maintain coherent reasoning as problem complexity increases, directly probing
in-context adaptation mechanisms.
feasibility:</p>
      <p>We hypothesize that reasoning degradation follows exponential decay RCS(, 0) ≈ 
parameter  characterizes initial reasoning quality and  &lt; 1
quantifies robustness to cascading
 where
complexity. Models with superior in-context learning should exhibit higher  values, indicating better
preservation of logical consistency under sequential perturbations.</p>
      <p>Narrative Coherence Assessment. The Narrative Coherence Check framework ( ) evaluates
temporal-spatial reasoning through natural language journey analysis. Given narrative  containing
transit descriptions, we extract temporal constraints  () and spatial assertions (), then verify
 (, ) = 1 ⎣
⎡</p>
      <p>⋀︁
(,)∈ ()×()</p>
      <p>⎤
feasible(, , )⎦
This framework probes how models integrate multiple information streams and detect logical
inconsistencies in complex scenarios, providing insights into compositional reasoning capabilities essential for
transit assistance.</p>
      <p>Constructive Error Correction. The Minimal Edit Plausibility framework (ℳℰ ) assesses
constructive problem-solving through systematic itinerary repair. For invalid journey , we seek optimal
correction  * that minimizes edit distance while preserving user intent:
(1)
(2)
(3)

 * = arg min  1‖‖ 1 +  2sem(, ()) +  3user()
where ‖‖ 1 represents edit magnitude, sem measures semantic preservation, and user quantifies user
impact. This framework reveals how models balance constraint satisfaction with solution quality,
directly probing constructive reasoning capabilities.</p>
      <p>Cross-Modal Spatial Reasoning. The Cross-Modal Anchoring framework (ℳ) evaluates spatial
textual markdown based integration through spatial-textual consistency analysis. For transit map 
and query , we measure spatial understanding alignment:
ℳ(, ,  ) = sim(spatial( ), spatial( ()))
(4)
where spatial extracts topological relationships. This framework probes how models integrate spatial and
textual information streams, essential for real-world transit applications involving map interpretation.</p>
      <p>Framework Integration for System Optimization. Our multi-dimensional approach enables
targeted post-training optimization. Models exhibiting low  values in  analysis benefit from
multistep reasoning augmentation in supervised fine-tuning. Strong   performance combined with
weak ℳℰ  scores suggests potential for reinforcement learning optimization targeting constructive
problem-solving. Framework correlations reveal architectural strengths: high -ℳℰ  correlation
indicates shared constructive reasoning mechanisms, while  -ℳ alignment suggests multimodal
integration capabilities.</p>
      <p>The theoretical foundation extends to system architecture analysis. Retrieval-augmented models
typically demonstrate strong   performance due to comprehensive knowledge base access but
exhibit brittleness in  scenarios requiring novel reasoning. Tool-calling architectures show variable
 performance depending on tool chain complexity while potentially excelling in ℳℰ  tasks when
appropriate repair tools are available.</p>
      <p>Furthermore, our framework provides theoretical justification for strategic deployment of smaller
language models in transit evaluation contexts. Recent work demonstrates that specialized smaller
models often outperform general-purpose large models in constrained domains due to focused parameter
utilization and reduced interference from irrelevant capabilities [14] especially for safety/time critical
transit.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>We evaluate four open-source language models, namely Gemma-7B, Mistral-7B, Llama3-7B, and Phi-7B,
selected for their demonstrated efectiveness in safety-critical transportation applications, particularly
their superior fine-tuning capabilities and performance in tool-calling and retrieval-augmented
generation tasks essential for real-world transit deployment. Our evaluation employs GTFS datasets from
San Francisco Municipal, Massachusetts Bay, and Chicago Transportation Authorities, constructing a
challenging benchmark with 500 samples each for  and   tasks, and 300 samples for ℳℰ  and
ℳ tasks. All the input samples are generated systematically generated based on trips, routes and
stops in the GTFS dataset, the text samples for NCC and MEP are constructed with accurate assertions
and false counterfactuals and for CMA task specifically, corpus samples are constructed like a markdown
spatial map structure based on the (S,R,T) GTFS data for assessing LLMs.</p>
      <p>Our evaluation metrics directly correspond to the mathematical frameworks established in Section 2.
For Perturbation Chains (), we measure sequential accuracy at increasing complexity (S2, S3, S5)
alongside Counterfactual Coherence and Skip2 Consistency to assess reasoning robustness as formalized
in Equation 1. Narrative Coherence Checks ( ) employ standard accuracy metrics complemented by
Balanced Accuracy, Binary Yes/No (Confirmation/Negation) Response based YES Recall, and YES Bias
Gap to capture the feasibility verification capabilities defined in Equation 2. Minimal Edit Plausibility
(ℳℰ ) introduces Over-repair and Under-repair rates that empirically measure the optimization edit
control central to Equation 3, revealing systematic temporal reasoning failures. Cross-Modal Anchoring
(ℳ) utilizes exact match accuracy, positional error, and (Stops, Routes Entity) flip rates to quantify
the spatial consistency formalized in Equation 4.</p>
      <p>The experimental results expose fundamental limitations in current model capabilities across all
reasoning dimensions, demonstrating the challenging nature of our benchmark. In Cross-Modal
Anchoring, even the best-performing model (Mistral) achieves only 49% exact spatial matching accuracy,
while Phi exhibits severe spatial disorientation with 21.3% Stop-Route flip errors and substantial
positional deviation (1.737 average error) reveal critical weaknesses</p>
      <p>Minimal Edit Plausibility results demonstrate systematic temporal reasoning failures across all models,
with over-repair and under-repair rates clustered around 50% each, indicating near-random performance
in optimizing itinerary corrections.</p>
      <p>Narrative Coherence assessment reveals a striking pattern of systematic bias toward positive
classifications, with all models exhibiting near-perfect YES Recall (96.9-99.3%) but correspondingly poor
overall accuracy (46-48.5%). The YES Bias Gap metrics (0.486-0.511) quantify this overconfidence in
declaring invalid journeys as feasible, representing a critical safety concern for deployment scenarios
where false positives could mislead passengers into impossible travel plans.</p>
      <p>Perturbation Chain analysis demonstrates the most dramatic capability degradation, validating our
theoretical framework’s prediction of reasoning brittleness under cascading complexity. While models
maintain reasonable performance at S2 (75-86% accuracy), performance deteriorates substantially by S3
(46.7-80%) with Phi showing catastrophic failure. Counterfactual Coherence(CF) scores uniformly below
6.2% across all models indicate severe limitations in maintaining logical consistency under hypothetical
scenarios, while Skip2 Consistency results (32.1-56%) reveal fundamental failures in multi-step reasoning
chains that our mathematical framework precisely captures.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Analysis &amp; Implications</title>
      <p>Our theoretical and empirical analysis establishes several key insights with direct implications for
transit system deployment. The exponential decay characterization of reasoning consistency provides a
principled foundation for system reliability assessment. Models with  &gt; 0.75 demonstrate suficient
robustness for deployment scenarios involving up to three cascade steps, while those with  &lt; 0.65
require architectural improvements or operational constraints limiting query complexity.</p>
      <p>Framework profiles enable targeted optimization strategies. Models exhibiting strong  
performance but weak  consistency benefit from multi-step reasoning augmentation in training data.
Systems showing high ℳℰ  capability combined with poor ℳ scores suggest potential for
multimodal training enhancement. This systematic approach transforms post-training optimization from
ad-hoc experimentation to principled engineering.</p>
      <p>The architectural insights derived from our analysis provide concrete guidance for system design
decisions. Applications requiring robust cascade reasoning should prioritize models with high  values
regardless of baseline accuracy. Systems emphasizing error recovery should target ℳℰ  optimization
through constructive training approaches. This framework-driven architecture selection enables optimal
resource selection assessment in deployment scenarios.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This work establishes a comprehensive theoretical framework for evaluating reasoning capabilities
in transit-domain Large Language Models that fundamentally transcends traditional accuracy-based
assessment. Our four-dimensional evaluation approach—Perturbation Chains, Narrative Coherence
Checks, Minimal Edit Plausibility, and Cross-Modal Anchoring—provides systematic methodology for
probing in-context learning, multi-step reasoning, and adaptive problem-solving capabilities essential
for real-world deployment. Beyond measurement, this framework enables strategic deployment of
specialized smaller models in safety-critical applications, provides theoretical justification for architecture
selection based on reasoning requirements, and establishes evaluation methodologies that align with
operational deployment constraints.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT, Grammarly in order to: Grammar
and spelling check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.
J. George, R. Green, P. Han, C. Tao, G. Clark, C. You, A. Abdolmaleki, J. Fu, T. Chen, A.
Chaugule, A. Chandorkar, A. Rahman, W. Thompson, P. Koanantakool, M. Bernico, J. Ren, A. Vlasov,
S. Vassilvitskii, M. Kula, Y. Liang, D. Kim, Y. Huang, C. Ye, D. Lepikhin, W. Helmholz, Gemini 2.5:
Pushing the frontier with advanced reasoning, multimodality, long context, and next generation
agentic capabilities, 2025. URL: https://arxiv.org/abs/2507.06261. arXiv:2507.06261.
[3] R. Zhang, D. Jiang, Y. Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, Y. Qiao, P. Gao,
H. Li, Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?,
in: A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, G. Varol (Eds.), Computer Vision –
ECCV 2024, Springer Nature Switzerland, Cham, 2025, pp. 169–186.
[4] A. Didolkar, A. Goyal, N. R. Ke, S. Guo, M. Valko, T. Lillicrap, D. Jimenez Rezende, Y. Bengio, M. C.</p>
      <p>Mozer, S. Arora, Metacognitive capabilities of llms: An exploration in mathematical problem
solving, Advances in Neural Information Processing Systems 37 (2024) 19783–19812.
[5] F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, Q. Wang, Clarifygpt: A framework for
enhancing llm-based code generation via requirements clarification, Proceedings of the ACM on
Software Engineering 1 (2024) 2332–2354.
[6] S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, S. K. Lahiri, Llm-based test-driven interactive
code generation: User study and empirical evaluation, IEEE Transactions on Software Engineering
(2024).
[7] M. Kwon, H. Hu, V. Myers, S. Karamcheti, A. Dragan, D. Sadigh, Toward grounded commonsense
reasoning, in: 2024 IEEE International Conference on Robotics and Automation (ICRA), IEEE,
2024, pp. 5463–5470.
[8] S. Krause, F. Stolzenburg, Commonsense reasoning and explainable artificial intelligence using
large language models, in: European Conference on Artificial Intelligence, Springer, 2023, pp.
302–319.
[9] A. Toroghi, W. Guo, A. Pesaranghader, S. Sanner, Verifiable, debuggable, and repairable
commonsense logical reasoning via llm-based theory resolution, in: Proceedings of the 2024 Conference
on Empirical Methods in Natural Language Processing, 2024, pp. 6634–6652.
[10] T. Zheng, C. Jiayang, C. Li, H. Shi, Z. Wang, J. Bai, Y. Song, G. Wong, S. See, Logidynamics:
Unraveling the dynamics of inductive, abductive and deductive logical inferences in llm reasoning,
in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,
2025, pp. 20721–20742.
[11] Z. Di, C. Zhang, H. Lv, L. Cui, L. Liu, Lorp: Llm-based logical reasoning via prolog,
Knowledge</p>
      <p>Based Systems (2025) 114140.
[12] S. Devunuri, L. J. Lehe, Transitgpt: A generative ai-based framework for interacting with
gtfs data using large language models, arXiv preprint arXiv:2412.06831v1 (2024). Available at
arXiv:2412.06831v1.
[13] S. Devunuri, S. Qiam, L. J. Lehe, Chatgpt for gtfs: Benchmarking llms on gtfs understanding and
retrieval, arXiv preprint arXiv:2308.02618 (2024). Available at arXiv:2308.02618.
[14] P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, P. Molchanov, Small
language models are the future of agentic ai, arXiv preprint arXiv:2506.02153v1 (2025). Available
at arXiv:2506.02153v1.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Workshop</surname>
          </string-name>
          , :,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Akiki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pavlick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ilić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hesslow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Castagné</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Luccioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yvon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gallé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Ammanamanchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>del Moral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ruwase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bawden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bekman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          , I. Beltagy,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. O.</given-names>
            <surname>Suarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Laurençon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Launay</surname>
          </string-name>
          , M. Mitchell,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gokaslan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Simhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soroa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Aji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alfassy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Nitzav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Emezue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Klamm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leong</surname>
          </string-name>
          , D. van Strien,
          <string-name>
            <given-names>D. I.</given-names>
            <surname>Adelani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Radev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Ponferrada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Levkovizh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>Natan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. D.</given-names>
            <surname>Toni</surname>
          </string-name>
          , G. Dupont, G. Kruszewski, G. Pistilli,
          <string-name>
            <given-names>H.</given-names>
            <surname>Elsahar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Benyamina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Abdulmumin</given-names>
            , I. Johnson, I.
            <surname>Gonzalez-Dios</surname>
          </string-name>
          , J. de la Rosa,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dodge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Frohberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tobing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bhattacharjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Almubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Werra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Phan</surname>
          </string-name>
          , L. B.
          <string-name>
            <surname>allal</surname>
            , L. Tanguy,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Muñoz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Masoud</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Grandury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Šaško</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Coavoux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. T.-J. Jiang</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          <string-name>
            <surname>Vu</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Jauhar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ghaleb</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Subramani</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Kassner</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Khamis</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Espejel</surname>
            , O. de Gibert,
            <given-names>P.</given-names>
            Villegas, P.
          </string-name>
          <string-name>
            <surname>Henderson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Colombo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Amuok</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Lhoest</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Harliman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Bommasani</surname>
            ,
            <given-names>R. L.</given-names>
          </string-name>
          <string-name>
            <surname>López</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>