Understanding High-complexity Technical and Regulatory Documents with State-of-the-Art Models: A Pilot Study Bernardo Magnini1,**,‡ , Alessandro Dal Pozzo2 and Roberto Zanoli1 1 Fondazione Bruno Kessler, Trento, Italy 2 Rete Ferroviaria Italiana S.p.A, Italy Abstract We explore the potential of state-of-the-art Large Language Models (LLMs) to reason on the content of high-complexity documents written in Italian. We focus on both technical documents (e.g., describing civil engineering works) and regulatory documents (e.g., describing procedures). While civil engineering documents contain crucial information that supports critical decision-making in construction, transportation and infrastructure projects, procedural documents outline essential guidelines and protocols that ensure efficient operations, adherence to safety standards and effective incident management. Although LLMs offer a promising solution for automating the extraction and comprehension of high-complexity documents, potentially transforming our interaction with technical information, LLMs may encounter significant challenges when processing such documents due to their complex structure, specialized terminology and strong reliance on graphical and visual elements. Moreover, LLMs are known to sometimes produce unexpected or incorrect analyses, a phenomenon referred to as hallucination. The goal of the paper is to conduct an assessment of LLM capacities along several dimensions, including the format of the document (i.e., selectable text PDFs versus scanned OCR PDFs), the structure of the documents (e.g., number of pages, date of the document), the graphical elements (e.g., tables, graphs, photos), the interpretation of text portions (e.g., make a summary), and the need of external knowledge (e.g., to interpret a mathematical expressions). To run the assessment, we took advantage of GPT-4omni, a large multi-modal model pre-trained on a variety of different data. Our findings suggest that there is great potential for real-world applications for high-complexity documents, although LLMs may still be susceptible to produce misleading information. Keywords LLMs, GPT-4omni, Information extraction, Technical documents, Procedural documents, Civil engineering 1. Introduction uments are available either in PDF format as scanned documents, or as PDFs processed with Optical Character Technical documents employed in civil engineering con- Recognition (OCR) software, introducing an additional tain information essential for planning, designing and layer of complexity due to potential variations in text constructing structures that need to ensure safety and recognition quality. Finally, civil engineering technical compliance with regulations. As an example, such high- documents are typically long, easily reaching hundreds complexity documents provide technical guidelines for of pages. Figure 1 shows one of the many visual elements managing the development of roads, bridges and other occurring in the technical documents (civil engineering transport networks. Additionally, these documents are projects in Italian) considered in this study. fundamental for public infrastructure projects, ensuring Similarly to technical documents, regulatory docu- they serve the community effectively and safely. These ments play an equally important role across the same documents are highly complex, particularly due to their sectors, as they outline the steps for managing incidents, multi-modal nature, where textual content is mixed with supervising safety procedures and ensuring regulatory several graphical content. The written content can vary compliance. For example, railway procedural documents from simple explanations to very detailed technical in- contain comprehensive instructions on handling inci- structions, often referring to specialized regulations. The dents and supervising safety measures, introducing addi- visual elements typically include tables with numbers, tional complexity through procedural frameworks. Al- math formulas and detailed drawings of engineering stuff, though procedural documents lack the visual complexity as well as photos from natural environments and render- typical of technical projects, such as the presence of fig- ing of a construction once realized. In addition, doc- ures, tables and graphs, they are dense with text, focusing on legal and procedural details. CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, The paper investigates how state-of-the-art genera- Dec 04 — 06, 2024, Pisa, Italy tive models are able to reason on the content of high- * Corresponding author. complexity technical and regulatory documents written † These authors contributed equally. in Italian. As generative models, both LLMs and Large $ magnini@fbk.eu (B. Magnini); a.dalpozzo@rfi.it (A. Dal Pozzo); Multimodal Models (LMMs), are rapidly becoming more zanoli@fbk.eu (R. Zanoli) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License and more powerful, our research questions aim at as- Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Assessment Framework We defined a series of questions to assess the model’s proficiency in interpreting written text and visual con- tent, including images and graphs. Table 1 lists queries designed to evaluate how well the model understands tex- tual content, assessing its performance across categories like “Bibliographic Information", “Document Structure" and “Text Interpretation". Similarly, Table 2 presents the list of queries aimed at assessing the model’s ability to interpret graphical content, including “Table", “Photo", “Figure", “Mathematical Expression" and “Graph". Additionally, we investigated the potential for the model to experience hallucinations by making “trap" Figure 1: Figure showing drainage outlets used at the junction questions designed to induce incorrect responses. For points between the bituminous membrane and the rainwater example, a question such as “How tall is the pylon of downpipe. the Zambana Vecchia-Fai della Paganella cableway men- tioned in paragraph 12.6?" was posed, even though nei- ther the specified paragraph nor the whole document contains any information about cableways. Other in- sessing their ability to extract and interpret key informa- stances include queries like “What is the highest value tion, this way reducing the need for manual reviews by in the fifth column of Table 12.8.1-1?", despite the spec- human experts. To this end, we have defined a simple ified table having only 4 columns. Trap questions are question-answer evaluation framework tailored to tech- highlighted in bold in the tables. nical and regulatory documents. As an example, we ask Human evaluators subsequently reviewed and ana- the model questions such as Provide a general summary lyzed all responses provided by the model. Each response of the technical specifications in the document and then we generated by the model was evaluated based on the fol- manually check the model answer. We also consider the lowing scoring: potential for LLMs/LMMs to generate content that is not grounded to the document, an issue often referred to as • 2 points for fully accurate responses: the answer model confabulations or hallucinations [1, 2]. To assess meets the prompt’s requirements completely, confabulations we included “trap" questions mentioning such as providing a full list of figures or a compre- non-existing objects in the document. Finally, the as- hensive summary of the document’s key content. sessment considers both selectable text PDFs, which are • 1 point for partially correct responses: the an- extractable and editable, and scanned OCR PDFs, where swer is incomplete, such as a list of figures miss- text is derived from scanning or from OCR. ing some entries or a summary that covers some A state-of-the-art survey on articles published between important points but omits others. 2000 and 2021, focusing on the applications of Text Min- • 0 points for incorrect responses: the answer fails ing in the construction industry was presented in [3]. [4] to meet requirements, such as a mostly incom- and [5] explored NLP application and development in con- plete or missing list of figures or a summary that struction. Various machine learning and deep learning- does not accurately match the document’s con- based NLP techniques, and their applications in construc- tent. tion research, are documented in [6]. There are several potential real-world applications of 2.1. Model LLMs in supporting and enhancing various sectors. Con- struction firms can exploit LLMs to assist in reviewing For our experiments we use GPT-4omni[7], available technical documents for safety regulations and building from OpenAI since April 2024, which represents a signifi- codes, helping simplifying compliance checks. Addition- cant advance in AI innovation by becoming the first truly ally, organizations with large document archives can multimodal model capable of interpreting and generating leverage LLMs to identify potential inconsistencies or various types of data, including text, images and audio. conflicts in procedures, providing valuable insights for further human review and ensuring adherence to unified 2.2. Dataset operational protocols. The dataset for our pilot experiments includes four high- complexity documents, two are technical specifications and two are regulatory documents. More specifically: Table 1 Questions (in Italian) used to test the model’s capacity to reason on textual content. “Trap" questions are highlighted in bold. Content Question 1. Bibliographic Estrai il nome completo degli autori del documento. Estrai il titolo completo del documento. Estrai la Information data di pubblicazione del documento. 2. Document Riporta l’esatto numero di pagine del documento. Riporta l’indice delle tabelle presenti nel documento. Structure Riporta l’indice delle figure presenti nel documento. 3. Text Interpre- Documento: Fai un riassunto generale del capitolato tecnico. Quali normative e regolamenti devono tation essere rispettati secondo il capitolato tecnico? Qual è la timeline del progetto come delineata nel capitolato tecnico? Qual e’ la lunghezza della fune portante della funivia descritta nel capitolato tecnico? Paragrafo: Riassumi il paragrafo II.12 PROCESSO DI CONDIVISIONE DELLE INDAGINI del documento seguente utilizzando un linguaggio tecnico. Includi tutte le informazioni pertinenti e fornisci un livello di dettaglio approfondito. Indica chiaramente eventuali riferimenti a documenti e procedure pertinenti. Come sono suddivise le attività di manutenzione ordinaria? Table 2 Questions (in Italian) used to test the model’s capacity to reason on pictures, graphs and tables. “Trap" questions are in bold. Content Question 4. Table Qual è il valore richiesto della resistenza a rottura per trazione su un provino longitudinale per la mem- brana inferiore da 4 mm? Cosa rappresenta la Tabella 12.8.1-2? Quali caratteristiche della membrana sono riportate nella Tabella 12.8.1-1 rispetto alla Tabella 12.8.1-2? Quale è il valore più alto nella quinta colonna della Tabella 12.8.1-1? Per quante tipologie di eventi di cui alla tabella allegato 9 è previsto l’invio dell’Avviso di Accadimento (AA)? 5. Photo Descrivi gli oggetti o le persone presenti nella figura 12.8.4.2.6.a? Il tubo verde nella figura passa sopra oppure sotto alla rotaia? Quanti alberi ci sono nella figura? 6. Figure Descrivi il contenuto della figura 12.8.4.2.5.c. Nella figura 12.8.4.2.5.c dove va posizionato il bocchettone in HDPN? Cosa rappresenta l’oggetto di colore rosso presente nella figura? 7. Mathematical Descrivi a cosa fa riferimento l’espressione matematica 11 ≤ 𝑛 ≤ 40 riportata nella tabella Tabella Expression 12.14.3.7. Cosa significa il simbolo ≤ nell’espressione matematica? Come si interpreta il prodotto che è presente nell’espressione matematica? 8. Graph Cosa è rappresentato nel grafico di figura 1? Cosa rappresenta l’asse delle X e l’asse delle Y del grafico? Quale unità di misura è utilizzata per esprimere i valori sull’asse delle Y? A quale valore della curva del grafico corrisponde il valore 100 delle X? • A 96-page technical specification document As far as the content of the four documents, the first for civil engineering works from the Italian page provides general information (bibliographic) about railways[8]. the document, including publication date and authors. • A 32-page document on the design of an outdoor An example is reported in Figure 2. swimming pool in Trentino-Alto Adige[9]. • A 49-page regulatory document from RFI out- linimg procedures for investigating railway inci- dents. • A 12-page regulatory document from RFI focus- ing on managing prescriptions and supervising activities by ANSFISA (Agenzia Nazionale per la Sicurezza Ferroviaria). Figure 2: Each document’s first page contains bibliographic information. The two technical documents are licensed for unre- stricted use in non-commercial, educational, or research contexts. In contrast, the two procedural documents re- Furthermore, the documents contain a combination of lated to the Italian railway system are intended only for photos, figures and tables, exemplified by Figures 1, 3, 4, internal RFI use and cannot be distributed. respectively. These visual elements are important for explaining technical details and the logical structure of Table 3 procedures, often substituting written descriptions. This Statistics on the documents used for assessment. means that the model frequently needs to interpret these Tech. Docs Reg. Docs visual elements without relying on explanations provided in the text. Content Railway Pool Railway Railway Pages 96 32 49 12 Tables 20 4 14 0 Photo 2 2 0 0 Figure 31 19 2 0 Graph 2 0 0 0 Figure 3: Photo showing a worker applying the waterproof membrane. Figure 5: Formula representing the number of constraint mechanisms (restraints) required to be tested according to the specifications outlined in the chapter. Figure 4: Excerpt of the table reporting the characteristics of the 4mm lower membrane. An important feature of our dataset is that it includes both selectable PDF and scanned OCR PDF. More specif- ically, the three RFI documents are selectable text PDF, where the text is digital, searchable and can be copied, typically created by word processors or digital publishing software. These documents contain pages with tables and figures, with some tables spanning multiple pages and Figure 6: Graphic representing melting of the stiffness of others presented as images. Certain figures and tables elastic devices of bearing devices. include captions, while others do not. The documents also includes formulas and graphics, such as those in Figures 5 and 6. On the other hand, the swimming pool which are internal to RFI, it was not necessary. For the document is a scanned OCR PDF, which is not directly contamination test, we masked document elements, such selectable and searchable. Some pages in this document as numbers and paragraph identifiers in the text, and are misaligned compared to the standard orientation, and asked the model to fill in these gaps. For instance, we it also includes tables and figures across the document. prompted the model with tasks like “Replace the MASK Table 3 shows a comparison of the key characteristics marker with the missing paragraph number in the fol- of these documents. lowing text". Results indicate that the model was unable to identify the missing words, suggesting that it is likely 2.3. Contamination Test to have not encountered these documents in the pre- training phase. Moreover, even if prior exposure to the We ran a contamination test to verify that GPT-4omni did documents could improve GPT’s performance, its unfa- not use in its pre-training the documents of our dataset. miliarity with the specific questions and answers should The test was carried out on two publicly available tech- limit its accuracy in responding. nical documents, while for the regulatory documents, 2.4. Experimental Setup Table 4 Results (accuracy) on regular questions. The overall accuracy There are two modalities to query GPT-4omni: using the on the dataset is 85.83%. OpenAI playground or the OpenAI API. We used the API because it allows for quickly scaling from analyzing a few Tech. Docs Reg. Docs documents to tens or thousands automatically, whereas Content Railway Pool Railway Avg. with the playground documents must be uploaded manu- Biblio. Info. 1.00 1.00 1.00 1.00 ally one at a time. We used OpenAI API version 1.34.0 in Doc. Struct. 0.50 0.67 0.92 0.75 conjunction with GPT-4omni version gpt-4o-2024-05-13. Text Interp. 0.80 1.00 0.62 0.76 Since GPT-4omni is not deterministic, even with tem- Table 1.00 1.00 0.80 0.90 perature set to 0, we kept all default parameters of the Photo 0.50 1.00 - 0.75 model. Figure 0.50 1.00 - 0.75 The PDF documents were first converted, using the Math Exp. 1.00 1.00 - 1.00 free online tool PDF24, into images, as PDF format in- Graph 1.00 - - 1.00 puts are not currently supported GPT-4omni API. This contrasts with the playground, where PDF uploads are allowed. Each document’s page was transformed into an Table 5 image, using the PNG format and setting the resolution to Results (accuracy) on “trap" questions. The overall accuracy on the dataset is 80.25%. 300 DPI to ensure high-quality reproduction of the origi- nal document pages. For each document, the images were Tech. Docs Reg. Docs then uploaded by the OpenAI API in the exact sequence Content Railway Pool Railway Avg. of their respective pages. Regarding the prompt used for querying the model, we used the following: Rispondi alla Biblio. Info. - - - - Doc. Struct. - - 1.00 1.00 seguente domanda basandoti sul capitolato tecnico fornito, Text Interp. 0.50 1.00 0.71 0.71 senza usare alcuna conoscenza preliminare. We tested GPT-4omni’s non-deterministic behavior by Table 0.00 1.00 1.00 0.75 making five requests per question set, using the shorter Photo 1.00 1.00 - 1.00 swimming pool document (32 pages), to avoid potential Figure 0.00 1.00 - 0.50 Math Exp. 0.00 1.00 - 0.50 server time-outs. For each set of questions, GPT-4omni Graph 1.00 - - 1.00 we assessed how consistent the answers are with each other on a scale from 0 (inconsistent) to 1 (consistent). The average consistency score across 8 question sets was 0.85. 3.1. Discussion As of writing time (June 2024), the cost of process- Results allow us to draw the following conclusions re- ing one prompt for one document in our dataset using garding GPT-4omni’s ability to understand textual and the OpenAI API is approximately $0.50. Processing time visual content for each question category. also needs to be considered. For instance, querying GPT- 4omni for the longer document (96 pages) takes an aver- Bibliographic Information. A perfect score for both age of 3 minutes and 20 seconds. technical and regulatory documents indicates that the model consistently retrieved bibliographic information 3. Results and Discussion (author, title, date) accurately. GPT-4omni achieves an average accuracy of 83,66% on Document Structure. GPT-4omni is not perfect at de- textual content and 88,00% on visual content, resulting in tecting the structure of the documents. For example, the an overall accuracy of 85.83%. However, accuracy drops model sometimes includes invented entries or omits the significantly, to 80,25%, when presented with questions entire index of the technical railway documents. This specifically designed to induce errors (“trap" questions). could be attributed to the document’s complexity, con- GPT-4omni’ scores for both textual content and graphical taining lengthy table labels (e.g., Table 12.8.2.1-1), a large elements, ranging from 0 (indicating no accuracy) to number of figures and tables (51), the absence of captions 1 (indicating perfect accuracy) are provided separately for some of them, and a high page count (96). We observe for regular questions (Table 4) and for “trap" questions that the model is highly sensitive to the prompts used. (Table 5). For instance, when prompted with: Report the number of tables present in the document for a regulatory document, the model inaccurately Quale è il valore più alto nella quinta returns a result of just one table. In contrast, when we colonna della Tabella 12.8.1-1? refined the prompt as: the model produced: Identify all the tables present in the following document. For each table Nella quinta colonna della Tabella 12.8.1-1, found, provide the page number where che rappresenta le tolleranze, il valore più it is located and the total number of alto è ± 20% per la resistenza a rottura per tables in the document trazione su provino longitudinale e trasver- sale, e per la stabilità di forma a caldo the model accurately lists the tables along with their corresponding pages and correctly identifies six tables. despite the absence of a fifth column. The model’s As for the pool document, the model did not extract the answer was so detailed that, without verifying the docu- exact number of pages, likely due to the absence of page ment, even a human might find it difficult to recognize numbers. that the response had no basis in the actual content. Text Interpretation. The model performs better in Photos. GPT-4omni can describe objects in photos of the pool document than on the railway documents in documents. However, in the railway technical document, text interpretation. In particular, GPT-4omni makes a the model missed crucial details, like a green tube and mistake in a paragraph-level “trap" question. When asked railway track in Figure 3, which affected its ability to about the height of the cable car pylon mentioned in answer questions about their relative position. Interest- paragraph 12.6, the model incorrectly claims it was 43 ingly, when provided only with the relevant page, the meters tall, despite neither the paragraph nor the entire model correctly identified all objects, including key ele- document containing any references to cable cars. As ments, suggesting that the document context or photo in the previous case, we found that the model is highly placement may influence its comprehension of images. sensitive to prompt phrasing. For example, when asked to: Figures. In the railway document, GPT-4omni fell into a “trap" question and additionally failed to accurately Riassumi il contenuto del paragrafo describe the content of Figure 1. Regarding the “trap" II.12 PROCESSO DI CONDIVISIONE question: DELLE INDAGINI Cosa rappresenta l’oggetto di colore the model provides a somewhat brief and general re- rosso presente nella figura? sponse. However, when the prompt was made more the model identified a red object that was not present specific, such as: in the figure and even provided a detailed description of it: Riassumi il paragrafo II.12 ’PRO- CESSO DI CONDIVISIONE DELLE L’oggetto di colore rosso presente nella INDAGINI’ del documento seguente figura rappresenta il "rivestimento imper- utilizzando un linguaggio tecnico. meabile" dello strato di colore rosso, come Includi tutte le informazioni perti- indicato nella legenda e nel contesto delle nenti e fornisci un livello di dettaglio spiegazioni tecniche fornite nella sezione approfondito. Indica chiaramente pertinente del capitolato tecnico, in parti- eventuali riferimenti a documenti e colare nelle fasi di impermeabilizzazione procedure pertinenti straordinaria e ripresa del rivestimento im- permeabile the model produces a much more accurate and detailed summary. As for the image description: Tables. As for interpreting table content, GPT-4omni Descrivi il contenuto della figura performs well in both document types. However, in the 12.8.4.2.5.c. Nella figura 12.8.4.2.5.c railway document, the model falls into the “trap" question dove va posizionato il bocchettone in by attempting to answer the query about the value in the HDPN? fifth column of a table with only four columns. When prompted with: GPT-4omni initially states that it cannot describe the image as it does not exist: La figura 12.8.4.2.5.c non è visibile nel doc- 4. Conclusion umento condiviso, quindi non posso fornire una descrizione dei suoi contenuti specifici We showed that GPT-4omni has a high potential for ana- lyzing technical and regulatory documents. However, the However, in the subsequent question about the place- model tends to make factual errors, to generate inaccu- ment of the nozzle, the model correctly described the rate details and to provide misleading answers supported nozzle placement: by technical explanations. These observations highlight potential limitations when handling long and complex Nella figura 12.8.4.2.5.c, il bocchettone in documents, and further research is needed to better un- HDPM va posizionato in corrispondenza dei derstand and address these challenges. Our study has fori di scarico, come indicato nella figura some limitations that should be considered. stessa Limited Sample Size. The evaluation was based on a dataset of four documents, which may not be representa- Math Expressions. GPT-4omni demonstrates good tive of the broader range of technical documents. performance in interpreting mathematical expressions in Query Format. We employed a multi-question prompt technical documents. However, in the railway document, format, grouping multiple questions within a single the model made a mistake on the “trap" question asking prompt. We plan to explore an approach where each about multiplication: question is presented as an individual prompt. Examining Positional Bias. There is a possibility that Come si interpreta il prodotto che è pre- the answer location within the document (beginning, sente nell’espressione matematica? middle, or end) might affect the model’s performance. Contextual Sensitivity Analysis. The amount of context in a formula that did not have any multiplication: provided could influence GPT in answering questions related to specific document elements. We plan to sys- Il prodotto presente nell’espressione tematically compare the model accuracy when presented matematica 11