1. Introduction

M. Oliverio);

A Modular LLM-based Dialog System for Accessible Exploration of Finite State Automata

Stefano Vittorio Porta

Pier Felice Balestrucci

Michael Oliverio

Luca Anselma

Alessandro Mazzei

0 0 Computer Science Department, University of Turin , Italy

2025

000 0 0003

In the field of assistive technologies, making accessible to visually impaired users complex visual content such as graphs or conceptual maps remains a significant challenge. This work proposes a modular dialog system that leverages a combination of neural Natural Language Understanding (NLU) and Retrieval-Augmented Generation (RAG) to translate graphical structures into meaningful text-based interactions. The NLU module combines a fine-tuned BERT classifier for intent recognition together with a spaCy-based Named Entity Recognition (NER) model to extract user intents and parameters. Moreover, the RAG pipeline retrieves relevant subgraphs and contextual information from a knowledge base, reranking and summarizing them via a language model. We evaluate the system across multiple specific tasks, achieving over 92% F1 in intent classification and NER, and demonstrate that even open-weight models, like DeepSeek-r1 or LLaMA-3.1, can ofer competitive performance compared to GPT-4o in specific domains. Our approach enhances accessibility while maintaining modularity, interpretability, and performance on par with modern LLM architectures.

eol>Dialogue Systems Retrieval-Augmented Generation Large Language Models Education

1. Introduction

tures combine LLMs with modular architectures, such to handle ambiguous or context-dependent queries, esas Retrieval-Augmented Generation (RAG) systems [ 7 ], pecially in domains that involve structured or graphical which integrate the text generation capabilities of LLMs information. with an information retrieval module for selecting and To overcome these limitations, modern DSs increaspresenting the most relevant information to the user. ingly adopt neural NLU methods. Intent classification is

In [ 4 ], we introduced AIML+, a novel framework based commonly modeled as a supervised classification task, on AIML, specifically developed for building DSs to as- where transformer-based models such as BERT have sist visually impaired students in navigating graphical demonstrated state-of-the-art performance [14, 15]. structures. The use of AIML was motivated by the need Early NER systems relied on hand-crafted rules and to provide accurate responses to users, although it also domain-specific features, which required significant hurevealed limitations in terms of NLU. Building on this, man efort and expertise [ 16]. Recent advances leverand with the goal of creating a reliable system suitable for age distributed representations, context encoders, and critical domains such as education, this paper extends our tag decoders, achieving state-of-the-art results with less previous work by integrating LLMs into rule-based DSs, manual feature engineering [17, 18] resulting in a RAG pipeline. This work aims to improve In parallel, RAG has emerged as a prominent approach the often brittle NLU of traditional rule-based approaches to enabling language models to ground their responses and to reduce hallucinations in NLG. in external knowledge. Although initially developed for

Specifically, our proposal employs a hybrid architec- open-domain QA and document-based tasks, its use in ture that combines: (i) an NLU module based on intent structured or symbolic domains, such as graphs, is gainclassifier and NER to interpret user utterances; (ii) a rule- ing attention, particularly in educational or assistive setbased information retrieval module to extract relevant tings [19, 20]. However, these systems often focus on information; and (iii) an LLM-based NLG module to gen- general factual retrieval and rarely address the accessibilerate the system response. ity needs of users navigating inherently visual content.

The paper is structured as follows. Section 2 reviews This work builds upon the NoVAGraphS project, related work in the field of accessible technologies and which first proposed transforming non-visual access to dialog systems. Section 3 presents the proposed method- graphical content into a dialog-based paradigm via handology. Section 5 focuses on the performance of the NLU crafted AIML conversational systems [ 2 ]. We build on pipeline, Section 6 explains the Dialog Manager and Re- this work by introducing a neural NLU pipeline and a trieval Layer logic, while Section 7 evaluates the genera- RAG component specifically tailored to the retrieval and tion module through both human and automatic assess- generation of descriptions from symbolic graph strucments. We conclude with a discussion of our findings tures. and future directions in Section 8.3

3. Methodology 2. Related Work We propose a modular dialog system based on

Accessible technologies have explored various strategies transformer-based components used for both NLU and to convey graphical information to VIP, including haptic NLG (see Figure 1). To build the NLU module, we exfeedback (e.g., vibrations and touch cues) [ 8, 9 ], soni- tended an existing resource [21] by applying both autoifcation (data-to-sound mappings) [ 10, 11], and textual matic data augmentation and manual annotation. In this descriptions [12, 13]. While efective in specific contexts, way, we have been able to train models for both the tasks these approaches often lack flexibility, interactivity, and of (1) Intent Classification and (2) Named-Entity Recognigeneralizability—particularly when dealing with com- tion. The output of the NLU module is then passed to the plex or symbolic visual content. To address these limita- dialog management module, a rule-based system respontions, DSs have been proposed as a more dynamic and sible for retrieving the specific information requested by user-adaptive interface for mediating access to graphical the user, referred to as retrieved evidence in this paper. structures. The retrieved evidence originates from structured knowl

Early DSs often relied on hand-crafted rules to parse edge bases that, in the experimentation described below, user input and generate responses. AIML [ 3 ], for instance, consists of a specific diagram. The NLG module employs encodes pattern-response pairs via XML, enabling de- a prompt built by the Dialog Manager to generate from terministic rule-based dialogs. Although accessible and LLMs natural and contextually relevant responses by interpretable, these systems lack the robustness required leveraging both the current user intent and the retrieved evidence.

Given our task-based approach, we focus on dialogs 3All code and experimental results are publicly available at https: about Finite State Automata (FSA) as a specific case study. //github.com/stefa168/tesi_tln.

User Input Is there a state called 's9' in the automaton? System Output There is no node called "s9" in the automaton.

NEURAL-NLU Intent Classifier Named-Entity Recognizer NLG-LLM

Intent + NE Intent = state.existence Entities = [(NODE, 's9')] Prompt with User Input + Retrieved Evidence

Query A.exists_node(s9) Query Signature node s9 existence Retrieved Evidence false

Automata

KB Dialog Manager

Retrieval Layer

FSA are mathematical models of computation typically taught in computer science degree programs which are often represented as structured graphs. They are formally defined as a quintuple consisting of: (1) a finite set of states , (2) a finite set of input symbols Σ , (3) a transition function : × Σ → that maps each state and input symbol to a new state, (4) a start state 0 ∈ , and (5) a set of accepting (or final) states ⊆ .

4. Data Collection and Annotation To develop the NLU module, we built upon an existing re

source, the NoVAGraphS corpus [21]. The corpus consists of 32 human–computer conversations focused on the domain of FSA, comprising a total of 706 dialog turns. Since our work focuses on understanding user input, we exclusively use the 353 human utterances from the dataset.

Based on this corpus, we extended the dataset through data augmentation techniques by using a mix of commercial and open-weight LLMs, including GPT-4o, GPT-o1, and GPT-o3.mini, as well as two locally run models, Llama3.1 and DeepSeek R1, generating paraphrases of the original utterances.4 To ensure data quality, we manually reviewed the synthetic utterances to verify their correctness. In addition, we also included 100 random of-topic questions extracted from the SQuAD 2.0 dataset [22, 23], selected to represent out-of-domain input5.

The final dataset contains 1, 080 user utterances. All utterances, both original and synthetic, were manually annotated by one of the authors—proficient in English—for both intent and entity information.

4https://openai.com/index/hello-gpt-4o/, https://openai.

com/index/openai-o3-mini/, IntroducingOpenAIo1, https: //ollama.com/library/deepseek-r1:8b, https://huggingface.co/ meta-llama/Llama-3.1-8B 5https://huggingface.co/datasets/rajpurkar/squad_v2

Intents We used a hierarchical labeling annotation to

better capture the specific topic of each user utterance. The resulting dataset consists of two levels of classes: main intents and sub-intents. Specifically, we defined 7 main intents representing the general topic of the question (Table 1). For four of these main intents (AUTOMATON, TRANSITION, STATE, and GRAMMAR) an additional annotation level, called sub-intent, was introduced. This second level includes a total of 32 sub-intents (Table 2), which specify the question’s more fine-grained topic depending on the main intent category.

6https://github.com/doccano/doccano An entity is encoded as

[init-char,fin-char,type] 5. Neural NLU • INPUT: for text fragments containing inputs or The first module of our architecture handles NLU through sequences of symbols. For example, in the sen- a two-step pipeline: (i) Intent Classification and (ii) tence “Does it only accept 1s and 0s?” there are two Named-Entity Recognition. The goal is to extract a entities of type INPUT: [20,21,"input"], structured representation of the user’s utterance by iden[27,28,"input"]; tifying the intent and the entities in the user input. For • NODE: for text fragments containing nodes or example: states of the automaton. For example, in the sentence “Is there a transition between q2 and q0?” there are two entities of type NODE: [30,32,"node"], [37,39,"node"]; Input: “Is there a state called s9 in the automaton?” Output: {

Intent = state.existence, Entities = [(NODE, ‘s9’)] • LANGUAGE: for text fragments containing information about the language accepted by the au- } tomaton. For example, in the sentence “Does the automaton accept strings over the alphabet To build the NLU module, we trained two models for {0,1}?” there is one entity of type LANGUAGE: Intent Classification and Named-Entity Recognition us[53,58,"language"]. ing the corpus described in Section 4, and we evaluated them against the AIML system we proposed in [24]. (a) AIML baseline (b) BERT model

6. Dialog Manager and Retrieval Layer Intent Classification For intent classification, we fine

tuned a BERT-base-uncased model7 for both main and sub-intent classification. The dataset was split into 60% training, 20% development, and 20% testing. We fine- The Dialog Manager is responsible for orchestrating the tuned with the following hyper-parameters: 20 epochs, interaction flow by interpreting the NLU output and coorLR 2× 10− 5, linear warm-up 10%, batch 16. Training was dinating the appropriate system response. This involves logged with Weights & Biases. Our approach signifi- analyzing the classified intent and any associated enticantly outperforms the AIML baseline, achieving a macro- ties, and invoking the corresponding function from the F1 score of 0.92 on main intents and 0.86 on sub-intents. Retrieval Layer.

This marks a substantial improvement over AIML, which The Retrieval Layer is activated whenever the recogscores only 0.33 and 0.20, respectively (see Table 3). Fig- nized intent is relevant to the domain, thus neither START ure 2 compares the confusion matrices for both systems, nor OFF TOPIC. Indeed, START typically triggers a welshowing that BERT produces far fewer of-topic errors come message, while OFF TOPIC handles inputs outside and handles ambiguous utterances more robustly. the system’s scope. Since these cases do not require access to the automaton’s knowledge, retrieval is skipped.

Table 3 For domain-specific intents (e.g., checking the exisPerformance on main and sub-intent classification for the tence of a state), the Dialog Manager uses a rule-based fine-tuned BERT model and the AIML baseline (↑ higher is system that maps intent–entity pairs to specific queries. better). This design ensures transparency and precise control Model Main Intent F1 Sub-intent F1 NER over system behavior. For instance, when the intent is state.existence and the entity is a node identiBERT (ours) 0.92 0.86 0.92 ifer like ‘s9’, the Dialog Manager calls the function AIML baseline 0.33 0.20 - exists_node(‘s9’). This function queries the underlying automaton representation to determine whether the specified node exists. The automaton is stored in a Named Entity Recognition NER is handled using a Knowledge Base (KB) constructed using the NetworkX simplified spaCy v3 pipeline that exclusively employs the Python library,9 which allows eficient graph manipuNER component on top of a blank model,8 fine-tuned on lation. The automaton’s structure is serialized in DOT our annotated dataset with the same data split (60/20/20). format, a standard for graph description, and visualized The pipeline is based on the transformer architecture [25] using Graphviz.10 and identifies domain-specific entities such as states, The Retrieval Layer then returns a structured output transitions and input strings. It achieves an F1- (e.g. false, if the node is not found), which is passed to score of 0.92 on the test set (see Table 3). the NLG module for the generation of the final response. 7https://huggingface.co/google-bert/bert-base-uncased 8https://spacy.io/usage/v3

9https://networkx.org/ 10https://graphviz.org/ 7. LLM-based NLG

a-Judge), applying the same error taxonomy. The annotator pool included 8 students from the Department of For the NLG module, we adopt a prompting strategy Computer Science, 2 with an engineering background, based on LLMs that uses both the user input and the and 2 from the Departments of History and Biology. The output of the Dialog Manager to generate contextually average age was 28, with a range from 21 to 68 years. relevant and accurate responses. This technique is widely Each annotator evaluated a subset of the responses, with adopted in RAG systems [ 7 ], as it enables the model to overlapping assignments to ensure that all 75 generated ground its answers in retrieved evidence, reducing hallu- answers were reviewed by multiple judges. cinations and increasing factual accuracy. Our prompt template drives the model to act as a domain-specific ex- Table 4 pert — in this case, for finite state automata — instructing Average percentage of answers containing at least one lait to use only the retrieved data without introducing ex- beled error, computed by aggregating the four error categories traneous information or explicit references to the source. (INCORRECT, NOT-CHECKABLE, MISLEADING, OTHER). Lower This approach helps maintain concise, focused answers values indicate better performance. that avoid potential confusion or unverifiable content.

Generator Human error ↓ GPT-4.5 error ↓ GPT-o3-mini GPT-4o DeepSeek-r1-8B Gemma2-9B LLaMA3.1-8B

LLaMA3.1-8B.11

To assess the quality of the generated answers, we con- • CLARITY: whether the response is understandducted a human evaluation using the FactGenie platform able and well-structured; [26]. A group of 12 volunteer annotators labeled each • USEFULNESS: whether the response is helpful generation according to four error categories defined by and provides relevant information; the taxonomy in Kasner and Dusek [27]. In particular: • OVERALL APPRECIATION: whether the response INCORRECT indicates that the text contradicts the data; is perceived as satisfactory or positively received NOT-CHECKABLE means the information cannot be veri- by the annotator; ifed; MISLEADING refers to text that is deceptive given • FACTUAL ACCURACY: whether the response is the context or omits crucial information; and OTHER in- entirely correct and free from factual errors. cludes problematic cases that do not fit into the other categories. In addition to human annotation, we also The same group of 12 human annotators performed performed automatic labeling using GPT-4.512 (LLM-as- labeling according to these dimensions. Table 5 shows that GPT-o3-mini receives the most 11https://openai.com/index/hello-gpt-4o/, https://openai.com/index/ favorable user judgments across all dimensions. Among openai-o3-mini/, https://ollama.com/library/deepseek-r1:8b, open-weight models, DeepSeek-r1-8B is the most poshttps://huggingface.co/google/gemma-2-9b, https://huggingface. itively rated, while LLaMA3.1-8B and Gemma2-9B re12chott/pmse:/t/ao-plleanmaia./cLolmam/ina-d3e.x1/-i8nBtroducing-gpt-4-5/ ceive consistently lower preferences from annotators.

8. Conclusions

This work presents a significant advancement over previous systems aimed at the exploration of graphical structures, by proposing a hybrid modular architecture that integrates NLU and NLG techniques based on Transformers and LLMs. The implemented DS addresses several key limitations of rule-based DSs, such as rigid pattern matching, limited context handling, and dificulties in interacting with external data sources.

Compared to AIML, our system stands out for its greater expressive flexibility and its ability to adapt to complex conversational flows, thanks to a more articulated dialog management mechanism. The introduction of a neural classifier for intent recognition, along with a spaCy-based NER module, has substantially improved the robustness of natural language understanding, achieving F1 scores above 90% for both Intent Classiifcation and NER. Moreover, the RAG component has significantly reduced hallucinations and ambiguity in generation, providing contextually accurate responses that are well-grounded in structured data.

The results demonstrate that a hybrid and modular approach can ensure accessibility, reliability, and control—fundamental features for the adoption of DSs in educational and assistive contexts. Our framework therefore represents a concrete step toward more interpretable, adaptable, and user-centered intelligent DSs. In future works we plan to evaluate the complete system with blind people.

9. Limitations While the system shows strengths in modularity, accu

racy, and integration of LLMs, a significant limitation persists: its accessibility has yet to be validated with learners.

Although designed with accessibility in mind, the system’s real-world efectiveness and usability—especially for visually impaired individuals interacting with graphical content—remain untested. Conducting a structured evaluation with these target users is crucial to determine its pedagogical impact and practical usability. agrams, in: Proceedings of the 21st International T.-M. Georgescu, Artificial intelligence text proACM SIGACCESS Conference on Computers and cessing using retrieval-augmented generation: ApAccessibility, ASSETS ’19, Association for Com- plications in business and education fields, Proputing Machinery, New York, NY, USA, 2019, p. ceedings of the International Conference on Busi312–317. doi:10.1145/3308561.3353811. ness Excellence 18 (2024) 209 – 222. doi:10.2478/ [10] D. Ahmetovic, C. Bernareggi, J. a. Guerreiro, S. Ma- picbe-2024-0018.

scetti, A. Capietto, Audiofunctions.web: Multi- [20] F. Miladi, V. Psyché, D. Lemire, Leveragmodal exploration of mathematical function graphs, ing gpt-4 for accuracy in education: A comin: Proceedings of the 16th International Web for parative study on retrieval-augmented generaAll Conference, W4A ’19, Association for Comput- tion in moocs (2024) 427–434. doi:10.1007/ ing Machinery, New York, NY, USA, 2019, pp. 1–10. 978-3-031-64315-6_40.

doi:10.1145/3315002.3317560. [21] E. Di Nuovo, M. Sanguinetti, P. F. Balestrucci, [11] J. Su, A. Rosenzweig, A. Goel, E. de Lara, K. N. L. Anselma, C. Bernareggi, A. Mazzei, Educational Truong, Timbremap: enabling the visually- dialogue systems for visually impaired students: Inimpaired to use maps on touch-enabled devices, troducing a task-oriented user-agent corpus, in: in: M. de Sá, L. Carriço, N. Correia (Eds.), Proceed- Proceedings of the 2024 Joint International Conings of the 12th Conference on Human-Computer ference on Computational Linguistics, Language Interaction with Mobile Devices and Services, Mo- Resources and Evaluation (LREC-COLING 2024), bile HCI 2010, Lisbon, Portugal, September 7- ELRA and ICCL, Torino, Italia, 2024, pp. 5507–5519. 10, 2010, ACM International Conference Proceed- [22] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, ing Series, ACM, 2010, pp. 17–26. doi:10.1145/ SQuAD: 100,000+ questions for machine compre1851600.1851606. hension of text, in: J. Su, K. Duh, X. Carreras [12] V. Sorge, M. Lee, S. Wilkinson, End-to-end solution (Eds.), Proceedings of the 2016 Conference on for accessible chemical diagrams, in: Proceedings Empirical Methods in Natural Language Processof the 12th International Web for All Conference, ing, Association for Computational Linguistics, W4A ’15, Association for Computing Machinery, Austin, Texas, 2016, pp. 2383–2392. URL: https: New York, NY, USA, 2015. doi:10.1145/2745555. //aclanthology.org/D16-1264. doi:10.18653/v1/ 2746667. D16-1264. arXiv:1606.05250. [13] S. Chockthanyawat, E. Chuangsuwanich, [23] P. Rajpurkar, R. Jia, P. Liang, Know what you A. Suchato, P. Punyabukkana, Towards automatic don’t know: Unanswerable questions for SQuAD, diagram description for the blind, in: i-CREATe. in: I. Gurevych, Y. Miyao (Eds.), Proceedings of The International Convention on Rehabilitation the 56th Annual Meeting of the Association for Engineering and Assistive Technology, 2017, pp. Computational Linguistics (Volume 2: Short Pa1–4. doi:10.13140/RG.2.2.11969.04961. pers), Association for Computational Linguistics, [14] Z. Zhang, Z. Zhang, H. Chen, Z. Zhang, A joint Melbourne, Australia, 2018, pp. 784–789. URL: learning framework with bert for spoken language https://aclanthology.org/P18-2124. doi:10.18653/ understanding, IEEE Access 7 (2019) 168849– v1/P18-2124. arXiv:1806.03822.

168858. doi:10.1109/ACCESS.2019.2954766. [24] P. F. Balestrucci, E. Di Nuovo, M. Sanguinetti, [15] M. Roman, A. Shahid, S. Khan, A. Koubâa, L. Yu, L. Anselma, C. Bernareggi, A. Mazzei, An educaCitation intent classification using word embed- tional dialogue system for visually impaired people, ding, IEEE Access 9 (2021) 9982–9995. doi:10. IEEE Access (2024).

1109/ACCESS.2021.3050547. [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, [16] D. Nadeau, S. Sekine, A survey of named entity L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Atrecognition and classification, Lingvisticae Inves- tention is all you need, Advances in neural infortigationes 30 (2007) 3–26. doi:10.1075/LI.30.1. mation processing systems 30 (2017).

03NAD. [26] Z. Kasner, O. Platek, P. Schmidtova, S. Balloccu, [17] J. Li, A. Sun, J. Han, C. Li, A survey on deep learning O. Dusek, factgenie: A framework for spanfor named entity recognition, IEEE Transactions on based evaluation of generated texts, in: S. MaKnowledge and Data Engineering 34 (2018) 50–70. hamood, N. L. Minh, D. Ippolito (Eds.), Proceeddoi:10.1109/TKDE.2020.2981314. ings of the 17th International Natural Language [18] P. Liu, Y. Guo, F. Wang, G. Li, Chinese named entity Generation Conference: System Demonstrations, recognition: The state of the art, Neurocomputing Association for Computational Linguistics, Tokyo, 473 (2021) 37–53. doi:10.1016/j.neucom.2021. Japan, 2024, pp. 13–15. URL: https://aclanthology. 10.101. org/2024.inlg-demos.5/. doi:10.18653/v1/2024. [19] B.-S. Posedaru, F.-V. Pantelimon, M.-N. Dulgheru, inlg-demos.5.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Power , The ALT Text: Accessible Learning with Technology , 2024 .

[2]

P. F.

Balestrucci ,

Anselma ,

Bernareggi ,

Mazzei , Building a spoken dialogue system for supporting blind people in accessing mathematical expressions , in: F. Boschetti,

G. E.

Lebani ,

Magnini , N. Novielli (Eds.), Proceedings of the 9th Italian Conference on Computational Linguistics (CLiC-it 2023 ), CEUR Workshop Proceedings, Venice, Italy, 2023 , pp. 70 - 77 . URL: https: //aclanthology.org/ 2023 .clicit- 1 .10/.

[3]

Wallace , The Elements of AIML Style, ALICE A.I Foundation , 2001 . Available at https://files.ifi.uzh. ch/cl/hess/classes/seminare/chatbots/style.pdf .

[4]

Oliverio ,

Piroi , D. De Giorgi,

P. F.

Balestrucci ,

Manolino ,

Mazzei ,

Anselma ,

Bernareggi ,

Serio ,

Sabena ,

Armano ,

Coriasco ,

Capietto , Novagraphs: Towards an accessible educational-oriented dialogue system , in: Proceedings of the Second International Workshop on Artiifcial INtelligent Systems in Education co-located with 23rd International Conference of the Italian Association for Artificial Intelligence (AIxIA 2024 ), 2024 .

[5]

D. G.

Bobrow ,

R. M.

Kaplan ,

Kay ,

D. A.

Norman ,

Thompson , T. Winograd, Gus, a frame-driven dialog system , Artificial Intelligence 8 ( 1977 ) 155 - 173 . URL: https://www.sciencedirect.com/science/ article/pii/0004370277900182. doi:https://doi. org/10.1016/ 0004 - 3702 ( 77 ) 90018 - 2 .

[6]

Abusitta ,

M. Q.

Li ,

B. C.

Fung , Survey on explainable ai: Techniques, challenges and open issues, Expert Systems with Applications 255 ( 2024 ) 124710 .

[7]

Gao ,

Xiong ,

Gao ,

Jia ,

Pan ,

Bi ,

Dai ,

Sun ,

Wang ,

Wang , Retrievalaugmented generation for large language models: A survey , 2024 . URL: https://arxiv.org/abs/2312.10997. arXiv: 2312 . 10997 .

[8]

Bernareggi ,

Comaschi , G. Dalto,

Mussio ,

L. Parasiliti

Provenza , Multimodal exploration and manipulation of graph structures , in: Proceedings of the 11th International Conference on Computers Helping People with Special Needs , ICCHP '08 , Springer-Verlag, Berlin, Heidelberg, 2008 , p. 934 - 937 . doi: 10 .1007/978-3- 540 -70540-6_ 140 .

[9]

Bernareggi ,

Ahmetovic , S. Mascetti, muGraph: Haptic Exploration and Editing of 3D Chemical Di [27]

Kasner ,

Dusek , Beyond traditional benchmarks: Analyzing behaviors of open LLMs on data-to-text generation , in: L. -W. Ku , A. Martins , V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Bangkok, Thailand, 2024 , pp. 12045 - 12072 . URL: https://aclanthology. org/ 2024 . acl-long . 651 /. doi: 10 .18653/v1/ 2024 . acl-long . 651 .