1. Introduction

URL: https://arxiv.org/abs/

Measuring What Matters: Probing Transit Reasoning Consistency in Large Language Models

Hariram Veeramani

hariram@ucla.edu 1

Surendrabikram Thapa

Usman Naseem

usman.naseem@mq.edu.au 0 0 School of Computing, Macquarie University , Sydney, NSW, 2113 , Australia 1 University of California Los Angeles , USA 2 Virginia Tech , Blacksburg, Virginia, 24060 , USA

2211

05100

We propose a micro benchmark along with a comprehensive evaluation framework for transit-domain Large Language Models that transcends traditional accuracy metrics by probing in-context learning capabilities and multi-step reasoning processes. Our approach introduces four complementary evaluation paradigms such as Perturbation Chains, Narrative Coherence Checks, Minimal Edit Plausibility, and Cross-Modal Anchoring, that collectively assess how models adapt, reason, and maintain consistency under domain-specific constraints. Through systematic evaluation of four state-of-the-art models, we demonstrate substantial performance disparities in cascading reasoning scenarios despite similar baseline accuracy, revealing fundamental limitations in current evaluation methodologies. Our framework along with the benchmark provides actionable insights for post-training optimization strategies, enables principled comparison of retrieval-augmented versus tool-calling architectures, and establishes theoretical foundations for deploying specialized smaller models in safety-critical transit applications. The benchmark and evaluation suite will be shared with community along with further extended studies.

eol>Composite Reasoning Multi-step Reasoning KG Reasoning Agentic Systems LLM Consistency LLM Evaluation

1. Introduction

Large Language Models have demonstrated remarkable capabilities across diverse reasoning tasks [ 1, 2 ], from mathematical problem-solving [3, 4] and code generation [5, 6] to commonsense reasoning [7, 8, 9] and logical inference [10, 11]. This success has motivated their deployment in increasingly complex real-world applications, including safety-critical domains such as public transit systems. Recent studies report that LLMs achieve accuracy rates exceeding 90% on General Transit Feed Specification (GTFS) tasks [12, 13], suggesting readiness for production deployment. However, these metrics fundamentally measure task completion rather than the underlying reasoning capabilities essential for real-world reliability. When passengers pose complex queries such as "Given current service disruptions, what alternative routes minimize both travel time and transfers while avoiding construction zones?", the system must demonstrate sophisticated in-context learning, multi-step reasoning, and adaptive problemsolving capabilities that traditional accuracy metrics cannot capture.

This discrepancy between measured performance and required reasoning capabilities represents a critical gap in current evaluation methodologies. Transit systems operate under strict safety and reliability constraints where reasoning failures can cascade into significant user impact. A system that achieves high accuracy on isolated queries but fails to maintain logical consistency under perturbations poses substantial deployment risks.

Our work addresses this evaluation gap through four framework contributions. First, we formalize mathematical frameworks that probe distinct dimensions of reasoning quality in transit-domain applications. Second, we demonstrate how these frameworks reveal fundamental diferences in in-context learning capabilities across model architectures. Third, we propose qualitative connections between evaluation outcomes and post-training optimization strategies, including supervised fine-tuning and reinforcement learning with focus on relatively smaller language models in domain-specific evaluation contexts, drawing on recent advances in agentic AI systems [14].

2. Multi-Dimensional Transit Reasoning Framework

||−1 Let = (, , ) represent a GTFS dataset where denotes stops, represents routes, and encompasses scheduled trips. Traditional evaluation computes binary accuracy as (, ) = ∑︀|=|1 1[ () = ] for model , query set , and ground truth responses . While computationally eficient, this formulation provides no insight into reasoning processes, failure propagation mechanisms, or in-context adaptation capabilities.

We propose a comprehensive evaluation framework Φ = {, , ℳℰ , ℳ} probe fundamental reasoning dimensions that emerge in transit-domain applications. designed to

Perturbation Chain Analysis. The Perturbation Chain framework () probes in-context learning robustness through systematic cascade testing. For base query 0 and perturbation sequence {}=1, we construct modified queries = (1 ) that incrementally alter system state. The reasoning consistency score quantifies degradation patterns:

− =1

RCS(, 0) = ∏︁ P[valid( ()) | valid( (−1 ))] where valid(·) indicates logical consistency with perturbed GTFS state. This formulation captures how efectively models maintain coherent reasoning as problem complexity increases, directly probing in-context adaptation mechanisms. feasibility:

We hypothesize that reasoning degradation follows exponential decay RCS(, 0) ≈ parameter characterizes initial reasoning quality and < 1 quantifies robustness to cascading where complexity. Models with superior in-context learning should exhibit higher values, indicating better preservation of logical consistency under sequential perturbations.

Narrative Coherence Assessment. The Narrative Coherence Check framework ( ) evaluates temporal-spatial reasoning through natural language journey analysis. Given narrative containing transit descriptions, we extract temporal constraints () and spatial assertions (), then verify (, ) = 1 ⎣ ⎡

⋀︁ (,)∈ ()×()

⎤ feasible(, , )⎦ This framework probes how models integrate multiple information streams and detect logical inconsistencies in complex scenarios, providing insights into compositional reasoning capabilities essential for transit assistance.

Constructive Error Correction. The Minimal Edit Plausibility framework (ℳℰ ) assesses constructive problem-solving through systematic itinerary repair. For invalid journey , we seek optimal correction * that minimizes edit distance while preserving user intent: (1) (2) (3) * = arg min 1‖‖ 1 + 2sem(, ()) + 3user() where ‖‖ 1 represents edit magnitude, sem measures semantic preservation, and user quantifies user impact. This framework reveals how models balance constraint satisfaction with solution quality, directly probing constructive reasoning capabilities.

Cross-Modal Spatial Reasoning. The Cross-Modal Anchoring framework (ℳ) evaluates spatial textual markdown based integration through spatial-textual consistency analysis. For transit map and query , we measure spatial understanding alignment: ℳ(, , ) = sim(spatial( ), spatial( ())) (4) where spatial extracts topological relationships. This framework probes how models integrate spatial and textual information streams, essential for real-world transit applications involving map interpretation.

Framework Integration for System Optimization. Our multi-dimensional approach enables targeted post-training optimization. Models exhibiting low values in analysis benefit from multistep reasoning augmentation in supervised fine-tuning. Strong performance combined with weak ℳℰ scores suggests potential for reinforcement learning optimization targeting constructive problem-solving. Framework correlations reveal architectural strengths: high -ℳℰ correlation indicates shared constructive reasoning mechanisms, while -ℳ alignment suggests multimodal integration capabilities.

The theoretical foundation extends to system architecture analysis. Retrieval-augmented models typically demonstrate strong performance due to comprehensive knowledge base access but exhibit brittleness in scenarios requiring novel reasoning. Tool-calling architectures show variable performance depending on tool chain complexity while potentially excelling in ℳℰ tasks when appropriate repair tools are available.

Furthermore, our framework provides theoretical justification for strategic deployment of smaller language models in transit evaluation contexts. Recent work demonstrates that specialized smaller models often outperform general-purpose large models in constrained domains due to focused parameter utilization and reduced interference from irrelevant capabilities [14] especially for safety/time critical transit.

3. Experiments

We evaluate four open-source language models, namely Gemma-7B, Mistral-7B, Llama3-7B, and Phi-7B, selected for their demonstrated efectiveness in safety-critical transportation applications, particularly their superior fine-tuning capabilities and performance in tool-calling and retrieval-augmented generation tasks essential for real-world transit deployment. Our evaluation employs GTFS datasets from San Francisco Municipal, Massachusetts Bay, and Chicago Transportation Authorities, constructing a challenging benchmark with 500 samples each for and tasks, and 300 samples for ℳℰ and ℳ tasks. All the input samples are generated systematically generated based on trips, routes and stops in the GTFS dataset, the text samples for NCC and MEP are constructed with accurate assertions and false counterfactuals and for CMA task specifically, corpus samples are constructed like a markdown spatial map structure based on the (S,R,T) GTFS data for assessing LLMs.

Our evaluation metrics directly correspond to the mathematical frameworks established in Section 2. For Perturbation Chains (), we measure sequential accuracy at increasing complexity (S2, S3, S5) alongside Counterfactual Coherence and Skip2 Consistency to assess reasoning robustness as formalized in Equation 1. Narrative Coherence Checks ( ) employ standard accuracy metrics complemented by Balanced Accuracy, Binary Yes/No (Confirmation/Negation) Response based YES Recall, and YES Bias Gap to capture the feasibility verification capabilities defined in Equation 2. Minimal Edit Plausibility (ℳℰ ) introduces Over-repair and Under-repair rates that empirically measure the optimization edit control central to Equation 3, revealing systematic temporal reasoning failures. Cross-Modal Anchoring (ℳ) utilizes exact match accuracy, positional error, and (Stops, Routes Entity) flip rates to quantify the spatial consistency formalized in Equation 4.

The experimental results expose fundamental limitations in current model capabilities across all reasoning dimensions, demonstrating the challenging nature of our benchmark. In Cross-Modal Anchoring, even the best-performing model (Mistral) achieves only 49% exact spatial matching accuracy, while Phi exhibits severe spatial disorientation with 21.3% Stop-Route flip errors and substantial positional deviation (1.737 average error) reveal critical weaknesses

Minimal Edit Plausibility results demonstrate systematic temporal reasoning failures across all models, with over-repair and under-repair rates clustered around 50% each, indicating near-random performance in optimizing itinerary corrections.

Narrative Coherence assessment reveals a striking pattern of systematic bias toward positive classifications, with all models exhibiting near-perfect YES Recall (96.9-99.3%) but correspondingly poor overall accuracy (46-48.5%). The YES Bias Gap metrics (0.486-0.511) quantify this overconfidence in declaring invalid journeys as feasible, representing a critical safety concern for deployment scenarios where false positives could mislead passengers into impossible travel plans.

Perturbation Chain analysis demonstrates the most dramatic capability degradation, validating our theoretical framework’s prediction of reasoning brittleness under cascading complexity. While models maintain reasonable performance at S2 (75-86% accuracy), performance deteriorates substantially by S3 (46.7-80%) with Phi showing catastrophic failure. Counterfactual Coherence(CF) scores uniformly below 6.2% across all models indicate severe limitations in maintaining logical consistency under hypothetical scenarios, while Skip2 Consistency results (32.1-56%) reveal fundamental failures in multi-step reasoning chains that our mathematical framework precisely captures.

4. Analysis & Implications

Our theoretical and empirical analysis establishes several key insights with direct implications for transit system deployment. The exponential decay characterization of reasoning consistency provides a principled foundation for system reliability assessment. Models with > 0.75 demonstrate suficient robustness for deployment scenarios involving up to three cascade steps, while those with < 0.65 require architectural improvements or operational constraints limiting query complexity.

Framework profiles enable targeted optimization strategies. Models exhibiting strong performance but weak consistency benefit from multi-step reasoning augmentation in training data. Systems showing high ℳℰ capability combined with poor ℳ scores suggest potential for multimodal training enhancement. This systematic approach transforms post-training optimization from ad-hoc experimentation to principled engineering.

The architectural insights derived from our analysis provide concrete guidance for system design decisions. Applications requiring robust cascade reasoning should prioritize models with high values regardless of baseline accuracy. Systems emphasizing error recovery should target ℳℰ optimization through constructive training approaches. This framework-driven architecture selection enables optimal resource selection assessment in deployment scenarios.

5. Conclusion

This work establishes a comprehensive theoretical framework for evaluating reasoning capabilities in transit-domain Large Language Models that fundamentally transcends traditional accuracy-based assessment. Our four-dimensional evaluation approach—Perturbation Chains, Narrative Coherence Checks, Minimal Edit Plausibility, and Cross-Modal Anchoring—provides systematic methodology for probing in-context learning, multi-step reasoning, and adaptive problem-solving capabilities essential for real-world deployment. Beyond measurement, this framework enables strategic deployment of specialized smaller models in safety-critical applications, provides theoretical justification for architecture selection based on reasoning requirements, and establishes evaluation methodologies that align with operational deployment constraints.

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT, Grammarly in order to: Grammar and spelling check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. J. George, R. Green, P. Han, C. Tao, G. Clark, C. You, A. Abdolmaleki, J. Fu, T. Chen, A. Chaugule, A. Chandorkar, A. Rahman, W. Thompson, P. Koanantakool, M. Bernico, J. Ren, A. Vlasov, S. Vassilvitskii, M. Kula, Y. Liang, D. Kim, Y. Huang, C. Ye, D. Lepikhin, W. Helmholz, Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL: https://arxiv.org/abs/2507.06261. arXiv:2507.06261. [3] R. Zhang, D. Jiang, Y. Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, P. Lu, K.-W. Chang, Y. Qiao, P. Gao, H. Li, Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, in: A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, G. Varol (Eds.), Computer Vision – ECCV 2024, Springer Nature Switzerland, Cham, 2025, pp. 169–186. [4] A. Didolkar, A. Goyal, N. R. Ke, S. Guo, M. Valko, T. Lillicrap, D. Jimenez Rezende, Y. Bengio, M. C.

Mozer, S. Arora, Metacognitive capabilities of llms: An exploration in mathematical problem solving, Advances in Neural Information Processing Systems 37 (2024) 19783–19812. [5] F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, Q. Wang, Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification, Proceedings of the ACM on Software Engineering 1 (2024) 2332–2354. [6] S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, S. K. Lahiri, Llm-based test-driven interactive code generation: User study and empirical evaluation, IEEE Transactions on Software Engineering (2024). [7] M. Kwon, H. Hu, V. Myers, S. Karamcheti, A. Dragan, D. Sadigh, Toward grounded commonsense reasoning, in: 2024 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2024, pp. 5463–5470. [8] S. Krause, F. Stolzenburg, Commonsense reasoning and explainable artificial intelligence using large language models, in: European Conference on Artificial Intelligence, Springer, 2023, pp. 302–319. [9] A. Toroghi, W. Guo, A. Pesaranghader, S. Sanner, Verifiable, debuggable, and repairable commonsense logical reasoning via llm-based theory resolution, in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 6634–6652. [10] T. Zheng, C. Jiayang, C. Li, H. Shi, Z. Wang, J. Bai, Y. Song, G. Wong, S. See, Logidynamics: Unraveling the dynamics of inductive, abductive and deductive logical inferences in llm reasoning, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 20721–20742. [11] Z. Di, C. Zhang, H. Lv, L. Cui, L. Liu, Lorp: Llm-based logical reasoning via prolog, Knowledge

Based Systems (2025) 114140. [12] S. Devunuri, L. J. Lehe, Transitgpt: A generative ai-based framework for interacting with gtfs data using large language models, arXiv preprint arXiv:2412.06831v1 (2024). Available at arXiv:2412.06831v1. [13] S. Devunuri, S. Qiam, L. J. Lehe, Chatgpt for gtfs: Benchmarking llms on gtfs understanding and retrieval, arXiv preprint arXiv:2308.02618 (2024). Available at arXiv:2308.02618. [14] P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, P. Molchanov, Small language models are the future of agentic ai, arXiv preprint arXiv:2506.02153v1 (2025). Available at arXiv:2506.02153v1.

[1]

Workshop , :,

T. L.

Scao ,

Fan ,

Akiki ,

Pavlick ,

Ilić ,

Hesslow ,

Castagné ,

A. S.

Luccioni ,

Yvon ,

Gallé ,

Tow ,

A. M.

Rush ,

Biderman ,

Webson ,

P. S.

Ammanamanchi ,

Wang ,

Sagot ,

Muennighof ,

A. V.

del Moral ,

Ruwase ,

Bawden ,

Bekman ,

McMillan-Major , I. Beltagy,

Nguyen ,

Saulnier ,

Tan ,

P. O.

Suarez ,

Sanh ,

Laurençon ,

Jernite ,

Launay , M. Mitchell,

Rafel ,

Gokaslan ,

Simhi ,

Soroa ,

A. F.

Aji ,

Alfassy ,

Rogers ,

A. K.

Nitzav ,

Xu ,

Mou ,

Emezue ,

Klamm ,

Leong , D. van Strien,

D. I.

Adelani ,

Radev ,

E. G.

Ponferrada ,

Levkovizh ,

Kim ,

E. B.

Natan ,

F. D.

Toni , G. Dupont, G. Kruszewski, G. Pistilli,

Elsahar ,

Benyamina ,

Tran , I. Yu , I. Abdulmumin , I. Johnson, I. Gonzalez-Dios , J. de la Rosa,

Chim ,

Dodge ,

Zhu ,

Chang ,

Frohberg ,

Tobing ,

Bhattacharjee ,

Almubarak ,

Chen ,

Lo ,

L. V.

Werra ,

Weber ,

Phan , L. B. allal , L. Tanguy, M.

Dey , M. R.

Muñoz , M.

Masoud , M.

Grandury , M.

Šaško , M.

Huang , M.

Coavoux , M.

Singh , M. T.-J. Jiang , M. C.

Vu , M. A.

Jauhar , M.

Ghaleb , N.

Subramani , N.

Kassner , N.

Khamis , O.

Nguyen , O.

Espejel , O. de Gibert, P. Villegas, P.

Henderson , P.

Colombo , P.

Amuok , Q.

Lhoest , R.

Harliman , R.

Bommasani , R. L.

López ,