Understanding High-complexity Technical and Regulatory
                                Documents with State-of-the-Art Models: A Pilot Study
                                Bernardo Magnini1,**,‡ , Alessandro Dal Pozzo2 and Roberto Zanoli1
                                1
                                    Fondazione Bruno Kessler, Trento, Italy
                                2
                                    Rete Ferroviaria Italiana S.p.A, Italy


                                                   Abstract
                                                   We explore the potential of state-of-the-art Large Language Models (LLMs) to reason on the content of high-complexity
                                                   documents written in Italian. We focus on both technical documents (e.g., describing civil engineering works) and regulatory
                                                   documents (e.g., describing procedures). While civil engineering documents contain crucial information that supports
                                                   critical decision-making in construction, transportation and infrastructure projects, procedural documents outline essential
                                                   guidelines and protocols that ensure efficient operations, adherence to safety standards and effective incident management.
                                                   Although LLMs offer a promising solution for automating the extraction and comprehension of high-complexity documents,
                                                   potentially transforming our interaction with technical information, LLMs may encounter significant challenges when
                                                   processing such documents due to their complex structure, specialized terminology and strong reliance on graphical and
                                                   visual elements. Moreover, LLMs are known to sometimes produce unexpected or incorrect analyses, a phenomenon referred
                                                   to as hallucination. The goal of the paper is to conduct an assessment of LLM capacities along several dimensions, including
                                                   the format of the document (i.e., selectable text PDFs versus scanned OCR PDFs), the structure of the documents (e.g., number
                                                   of pages, date of the document), the graphical elements (e.g., tables, graphs, photos), the interpretation of text portions (e.g.,
                                                   make a summary), and the need of external knowledge (e.g., to interpret a mathematical expressions). To run the assessment,
                                                   we took advantage of GPT-4omni, a large multi-modal model pre-trained on a variety of different data. Our findings suggest
                                                   that there is great potential for real-world applications for high-complexity documents, although LLMs may still be susceptible
                                                   to produce misleading information.

                                                   Keywords
                                                   LLMs, GPT-4omni, Information extraction, Technical documents, Procedural documents, Civil engineering


                                1. Introduction                                                                                        uments are available either in PDF format as scanned
                                                                                                                                       documents, or as PDFs processed with Optical Character
                                Technical documents employed in civil engineering con- Recognition (OCR) software, introducing an additional
                                tain information essential for planning, designing and layer of complexity due to potential variations in text
                                constructing structures that need to ensure safety and recognition quality. Finally, civil engineering technical
                                compliance with regulations. As an example, such high- documents are typically long, easily reaching hundreds
                                complexity documents provide technical guidelines for of pages. Figure 1 shows one of the many visual elements
                                managing the development of roads, bridges and other occurring in the technical documents (civil engineering
                                transport networks. Additionally, these documents are projects in Italian) considered in this study.
                                fundamental for public infrastructure projects, ensuring                                                  Similarly to technical documents, regulatory docu-
                                they serve the community effectively and safely. These ments play an equally important role across the same
                                documents are highly complex, particularly due to their sectors, as they outline the steps for managing incidents,
                                multi-modal nature, where textual content is mixed with supervising safety procedures and ensuring regulatory
                                several graphical content. The written content can vary compliance. For example, railway procedural documents
                                from simple explanations to very detailed technical in- contain comprehensive instructions on handling inci-
                                structions, often referring to specialized regulations. The dents and supervising safety measures, introducing addi-
                                visual elements typically include tables with numbers, tional complexity through procedural frameworks. Al-
                                math formulas and detailed drawings of engineering stuff, though procedural documents lack the visual complexity
                                as well as photos from natural environments and render- typical of technical projects, such as the presence of fig-
                                ing of a construction once realized. In addition, doc- ures, tables and graphs, they are dense with text, focusing
                                                                                                                                       on legal and procedural details.
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                      The paper investigates how state-of-the-art genera-
                                Dec 04 — 06, 2024, Pisa, Italy                                                                         tive models are able to reason on the content of high-
                                *
                                  Corresponding author.                                                                                complexity technical and regulatory documents written
                                †
                                  These authors contributed equally.                                                                   in Italian. As generative models, both LLMs and Large
                                $ magnini@fbk.eu (B. Magnini); a.dalpozzo@rfi.it (A. Dal Pozzo);
                                                                                                                                       Multimodal Models (LMMs), are rapidly becoming more
                                zanoli@fbk.eu (R. Zanoli)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License and more powerful, our research questions aim at as-
                                             Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                 2. Assessment Framework
                                                                 We defined a series of questions to assess the model’s
                                                                 proficiency in interpreting written text and visual con-
                                                                 tent, including images and graphs. Table 1 lists queries
                                                                 designed to evaluate how well the model understands tex-
                                                                 tual content, assessing its performance across categories
                                                                 like “Bibliographic Information", “Document Structure"
                                                                 and “Text Interpretation". Similarly, Table 2 presents the
                                                                 list of queries aimed at assessing the model’s ability to
                                                                 interpret graphical content, including “Table", “Photo",
                                                                 “Figure", “Mathematical Expression" and “Graph".
                                                                    Additionally, we investigated the potential for the
                                                                 model to experience hallucinations by making “trap"
Figure 1: Figure showing drainage outlets used at the junction   questions designed to induce incorrect responses. For
points between the bituminous membrane and the rainwater         example, a question such as “How tall is the pylon of
downpipe.                                                        the Zambana Vecchia-Fai della Paganella cableway men-
                                                                 tioned in paragraph 12.6?" was posed, even though nei-
                                                                 ther the specified paragraph nor the whole document
                                                                 contains any information about cableways. Other in-
sessing their ability to extract and interpret key informa-
                                                                 stances include queries like “What is the highest value
tion, this way reducing the need for manual reviews by
                                                                 in the fifth column of Table 12.8.1-1?", despite the spec-
human experts. To this end, we have defined a simple
                                                                 ified table having only 4 columns. Trap questions are
question-answer evaluation framework tailored to tech-
                                                                 highlighted in bold in the tables.
nical and regulatory documents. As an example, we ask
                                                                    Human evaluators subsequently reviewed and ana-
the model questions such as Provide a general summary
                                                                 lyzed all responses provided by the model. Each response
of the technical specifications in the document and then we
                                                                 generated by the model was evaluated based on the fol-
manually check the model answer. We also consider the
                                                                 lowing scoring:
potential for LLMs/LMMs to generate content that is not
grounded to the document, an issue often referred to as               • 2 points for fully accurate responses: the answer
model confabulations or hallucinations [1, 2]. To assess                meets the prompt’s requirements completely,
confabulations we included “trap" questions mentioning                  such as providing a full list of figures or a compre-
non-existing objects in the document. Finally, the as-                  hensive summary of the document’s key content.
sessment considers both selectable text PDFs, which are               • 1 point for partially correct responses: the an-
extractable and editable, and scanned OCR PDFs, where                   swer is incomplete, such as a list of figures miss-
text is derived from scanning or from OCR.                              ing some entries or a summary that covers some
   A state-of-the-art survey on articles published between              important points but omits others.
2000 and 2021, focusing on the applications of Text Min-              • 0 points for incorrect responses: the answer fails
ing in the construction industry was presented in [3]. [4]              to meet requirements, such as a mostly incom-
and [5] explored NLP application and development in con-                plete or missing list of figures or a summary that
struction. Various machine learning and deep learning-                  does not accurately match the document’s con-
based NLP techniques, and their applications in construc-               tent.
tion research, are documented in [6].
   There are several potential real-world applications of
                                                                 2.1. Model
LLMs in supporting and enhancing various sectors. Con-
struction firms can exploit LLMs to assist in reviewing          For our experiments we use GPT-4omni[7], available
technical documents for safety regulations and building          from OpenAI since April 2024, which represents a signifi-
codes, helping simplifying compliance checks. Addition-          cant advance in AI innovation by becoming the first truly
ally, organizations with large document archives can             multimodal model capable of interpreting and generating
leverage LLMs to identify potential inconsistencies or           various types of data, including text, images and audio.
conflicts in procedures, providing valuable insights for
further human review and ensuring adherence to unified           2.2. Dataset
operational protocols.
                                                                 The dataset for our pilot experiments includes four high-
                                                                 complexity documents, two are technical specifications
                                                                 and two are regulatory documents. More specifically:
Table 1
Questions (in Italian) used to test the model’s capacity to reason on textual content. “Trap" questions are highlighted in bold.

 Content              Question
 1. Bibliographic     Estrai il nome completo degli autori del documento. Estrai il titolo completo del documento. Estrai la
 Information          data di pubblicazione del documento.
 2.     Document      Riporta l’esatto numero di pagine del documento. Riporta l’indice delle tabelle presenti nel documento.
 Structure            Riporta l’indice delle figure presenti nel documento.
 3. Text Interpre-    Documento: Fai un riassunto generale del capitolato tecnico. Quali normative e regolamenti devono
 tation               essere rispettati secondo il capitolato tecnico? Qual è la timeline del progetto come delineata nel
                      capitolato tecnico? Qual e’ la lunghezza della fune portante della funivia descritta nel capitolato
                      tecnico?
                      Paragrafo: Riassumi il paragrafo II.12 PROCESSO DI CONDIVISIONE DELLE INDAGINI del documento
                      seguente utilizzando un linguaggio tecnico. Includi tutte le informazioni pertinenti e fornisci un livello
                      di dettaglio approfondito. Indica chiaramente eventuali riferimenti a documenti e procedure pertinenti.
                      Come sono suddivise le attività di manutenzione ordinaria?


Table 2
Questions (in Italian) used to test the model’s capacity to reason on pictures, graphs and tables. “Trap" questions are in bold.

 Content              Question
 4. Table             Qual è il valore richiesto della resistenza a rottura per trazione su un provino longitudinale per la mem-
                      brana inferiore da 4 mm? Cosa rappresenta la Tabella 12.8.1-2? Quali caratteristiche della membrana
                      sono riportate nella Tabella 12.8.1-1 rispetto alla Tabella 12.8.1-2? Quale è il valore più alto nella
                      quinta colonna della Tabella 12.8.1-1?
                      Per quante tipologie di eventi di cui alla tabella allegato 9 è previsto l’invio dell’Avviso di Accadimento
                      (AA)?
 5. Photo             Descrivi gli oggetti o le persone presenti nella figura 12.8.4.2.6.a? Il tubo verde nella figura passa sopra
                      oppure sotto alla rotaia? Quanti alberi ci sono nella figura?
 6. Figure            Descrivi il contenuto della figura 12.8.4.2.5.c. Nella figura 12.8.4.2.5.c dove va posizionato il bocchettone
                      in HDPN? Cosa rappresenta l’oggetto di colore rosso presente nella figura?
 7. Mathematical      Descrivi a cosa fa riferimento l’espressione matematica 11 ≤ 𝑛 ≤ 40 riportata nella tabella Tabella
 Expression           12.14.3.7. Cosa significa il simbolo ≤ nell’espressione matematica? Come si interpreta il prodotto
                      che è presente nell’espressione matematica?
 8. Graph             Cosa è rappresentato nel grafico di figura 1? Cosa rappresenta l’asse delle X e l’asse delle Y del grafico?
                      Quale unità di misura è utilizzata per esprimere i valori sull’asse delle Y? A quale valore della curva
                      del grafico corrisponde il valore 100 delle X?


     • A 96-page technical specification document                   As far as the content of the four documents, the first
       for civil engineering works from the Italian               page provides general information (bibliographic) about
       railways[8].                                               the document, including publication date and authors.
     • A 32-page document on the design of an outdoor             An example is reported in Figure 2.
       swimming pool in Trentino-Alto Adige[9].
     • A 49-page regulatory document from RFI out-
       linimg procedures for investigating railway inci-
       dents.
     • A 12-page regulatory document from RFI focus-
       ing on managing prescriptions and supervising
       activities by ANSFISA (Agenzia Nazionale per la
       Sicurezza Ferroviaria).                                    Figure 2: Each document’s first page contains bibliographic
                                                                  information.
   The two technical documents are licensed for unre-
stricted use in non-commercial, educational, or research
contexts. In contrast, the two procedural documents re-     Furthermore, the documents contain a combination of
lated to the Italian railway system are intended only for photos, figures and tables, exemplified by Figures 1, 3, 4,
internal RFI use and cannot be distributed.               respectively. These visual elements are important for
explaining technical details and the logical structure of         Table 3
procedures, often substituting written descriptions. This         Statistics on the documents used for assessment.
means that the model frequently needs to interpret these                             Tech. Docs            Reg. Docs
visual elements without relying on explanations provided
in the text.                                                          Content     Railway     Pool     Railway    Railway
                                                                      Pages          96        32        49          12
                                                                      Tables         20         4        14          0
                                                                      Photo           2         2        0           0
                                                                      Figure         31        19        2           0
                                                                      Graph           2         0        0           0


Figure 3: Photo showing a worker applying the waterproof
membrane.

                                                                  Figure 5: Formula representing the number of constraint
                                                                  mechanisms (restraints) required to be tested according to the
                                                                  specifications outlined in the chapter.


Figure 4: Excerpt of the table reporting the characteristics of
the 4mm lower membrane.


   An important feature of our dataset is that it includes
both selectable PDF and scanned OCR PDF. More specif-
ically, the three RFI documents are selectable text PDF,
where the text is digital, searchable and can be copied,
typically created by word processors or digital publishing
software. These documents contain pages with tables and
figures, with some tables spanning multiple pages and             Figure 6: Graphic representing melting of the stiffness of
others presented as images. Certain figures and tables            elastic devices of bearing devices.
include captions, while others do not. The documents
also includes formulas and graphics, such as those in
Figures 5 and 6. On the other hand, the swimming pool     which are internal to RFI, it was not necessary. For the
document is a scanned OCR PDF, which is not directly      contamination test, we masked document elements, such
selectable and searchable. Some pages in this document    as numbers and paragraph identifiers in the text, and
are misaligned compared to the standard orientation, and  asked the model to fill in these gaps. For instance, we
it also includes tables and figures across the document.  prompted the model with tasks like “Replace the MASK
   Table 3 shows a comparison of the key characteristics  marker with the missing paragraph number in the fol-
of these documents.                                       lowing text". Results indicate that the model was unable
                                                          to identify the missing words, suggesting that it is likely
2.3. Contamination Test                                   to have not encountered these documents in the pre-
                                                          training phase. Moreover, even if prior exposure to the
We ran a contamination test to verify that GPT-4omni did documents could improve GPT’s performance, its unfa-
not use in its pre-training the documents of our dataset. miliarity with the specific questions and answers should
The test was carried out on two publicly available tech- limit its accuracy in responding.
nical documents, while for the regulatory documents,
2.4. Experimental Setup                                      Table 4
                                                             Results (accuracy) on regular questions. The overall accuracy
There are two modalities to query GPT-4omni: using the       on the dataset is 85.83%.
OpenAI playground or the OpenAI API. We used the API
because it allows for quickly scaling from analyzing a few                         Tech. Docs       Reg. Docs
documents to tens or thousands automatically, whereas           Content         Railway     Pool     Railway      Avg.
with the playground documents must be uploaded manu-            Biblio. Info.     1.00      1.00       1.00       1.00
ally one at a time. We used OpenAI API version 1.34.0 in        Doc. Struct.      0.50      0.67       0.92       0.75
conjunction with GPT-4omni version gpt-4o-2024-05-13.           Text Interp.      0.80      1.00       0.62       0.76
Since GPT-4omni is not deterministic, even with tem-
                                                                Table             1.00      1.00       0.80       0.90
perature set to 0, we kept all default parameters of the
                                                                Photo             0.50      1.00         -        0.75
model.                                                          Figure            0.50      1.00         -        0.75
   The PDF documents were first converted, using the            Math Exp.         1.00      1.00         -        1.00
free online tool PDF24, into images, as PDF format in-          Graph             1.00        -          -        1.00
puts are not currently supported GPT-4omni API. This
contrasts with the playground, where PDF uploads are
allowed. Each document’s page was transformed into an        Table 5
image, using the PNG format and setting the resolution to    Results (accuracy) on “trap" questions. The overall accuracy
                                                             on the dataset is 80.25%.
300 DPI to ensure high-quality reproduction of the origi-
nal document pages. For each document, the images were                             Tech. Docs       Reg. Docs
then uploaded by the OpenAI API in the exact sequence           Content         Railway     Pool     Railway      Avg.
of their respective pages. Regarding the prompt used for
querying the model, we used the following: Rispondi alla        Biblio. Info.       -         -          -          -
                                                                Doc. Struct.        -         -        1.00       1.00
seguente domanda basandoti sul capitolato tecnico fornito,
                                                                Text Interp.      0.50      1.00       0.71       0.71
senza usare alcuna conoscenza preliminare.
   We tested GPT-4omni’s non-deterministic behavior by          Table             0.00      1.00       1.00       0.75
making five requests per question set, using the shorter        Photo             1.00      1.00         -        1.00
swimming pool document (32 pages), to avoid potential           Figure            0.00      1.00         -        0.50
                                                                Math Exp.         0.00      1.00         -        0.50
server time-outs. For each set of questions, GPT-4omni
                                                                Graph             1.00        -          -        1.00
we assessed how consistent the answers are with each
other on a scale from 0 (inconsistent) to 1 (consistent).
The average consistency score across 8 question sets was
0.85.                                                        3.1. Discussion
   As of writing time (June 2024), the cost of process-      Results allow us to draw the following conclusions re-
ing one prompt for one document in our dataset using         garding GPT-4omni’s ability to understand textual and
the OpenAI API is approximately $0.50. Processing time       visual content for each question category.
also needs to be considered. For instance, querying GPT-
4omni for the longer document (96 pages) takes an aver-      Bibliographic Information. A perfect score for both
age of 3 minutes and 20 seconds.                             technical and regulatory documents indicates that the
                                                             model consistently retrieved bibliographic information
3. Results and Discussion                                    (author, title, date) accurately.

GPT-4omni achieves an average accuracy of 83,66% on          Document Structure. GPT-4omni is not perfect at de-
textual content and 88,00% on visual content, resulting in   tecting the structure of the documents. For example, the
an overall accuracy of 85.83%. However, accuracy drops       model sometimes includes invented entries or omits the
significantly, to 80,25%, when presented with questions      entire index of the technical railway documents. This
specifically designed to induce errors (“trap" questions).   could be attributed to the document’s complexity, con-
GPT-4omni’ scores for both textual content and graphical     taining lengthy table labels (e.g., Table 12.8.2.1-1), a large
elements, ranging from 0 (indicating no accuracy) to         number of figures and tables (51), the absence of captions
1 (indicating perfect accuracy) are provided separately      for some of them, and a high page count (96). We observe
for regular questions (Table 4) and for “trap" questions     that the model is highly sensitive to the prompts used.
(Table 5).                                                   For instance, when prompted with:

                                                                     Report the number of tables present in
                                                                     the document
   for a regulatory document, the model inaccurately                Quale è il valore più alto nella quinta
returns a result of just one table. In contrast, when we            colonna della Tabella 12.8.1-1?
refined the prompt as:
                                                               the model produced:
       Identify all the tables present in the
       following document. For each table                           Nella quinta colonna della Tabella 12.8.1-1,
       found, provide the page number where                         che rappresenta le tolleranze, il valore più
       it is located and the total number of                        alto è ± 20% per la resistenza a rottura per
       tables in the document                                       trazione su provino longitudinale e trasver-
                                                                    sale, e per la stabilità di forma a caldo
  the model accurately lists the tables along with their
corresponding pages and correctly identifies six tables.   despite the absence of a fifth column. The model’s
As for the pool document, the model did not extract the  answer  was so detailed that, without verifying the docu-
exact number of pages, likely due to the absence of page ment,  even a human might find it difficult to recognize
numbers.                                                 that the response had no basis in the actual content.

Text Interpretation. The model performs better in            Photos. GPT-4omni can describe objects in photos of
the pool document than on the railway documents in           documents. However, in the railway technical document,
text interpretation. In particular, GPT-4omni makes a        the model missed crucial details, like a green tube and
mistake in a paragraph-level “trap" question. When asked     railway track in Figure 3, which affected its ability to
about the height of the cable car pylon mentioned in         answer questions about their relative position. Interest-
paragraph 12.6, the model incorrectly claims it was 43       ingly, when provided only with the relevant page, the
meters tall, despite neither the paragraph nor the entire    model correctly identified all objects, including key ele-
document containing any references to cable cars. As         ments, suggesting that the document context or photo
in the previous case, we found that the model is highly      placement may influence its comprehension of images.
sensitive to prompt phrasing. For example, when asked
to:                                                          Figures. In the railway document, GPT-4omni fell into
                                                             a “trap" question and additionally failed to accurately
       Riassumi il contenuto del paragrafo                   describe the content of Figure 1. Regarding the “trap"
       II.12 PROCESSO DI CONDIVISIONE                        question:
       DELLE INDAGINI
                                                                Cosa rappresenta l’oggetto di colore
  the model provides a somewhat brief and general re-           rosso presente nella figura?
sponse. However, when the prompt was made more             the model identified a red object that was not present
specific, such as:                                     in the figure and even provided a detailed description of
                                                       it:
        Riassumi il paragrafo II.12 ’PRO-
        CESSO DI CONDIVISIONE DELLE                             L’oggetto di colore rosso presente nella
        INDAGINI’ del documento seguente                        figura rappresenta il "rivestimento imper-
        utilizzando un linguaggio tecnico.                      meabile" dello strato di colore rosso, come
        Includi tutte le informazioni perti-                    indicato nella legenda e nel contesto delle
        nenti e fornisci un livello di dettaglio                spiegazioni tecniche fornite nella sezione
        approfondito. Indica chiaramente                        pertinente del capitolato tecnico, in parti-
        eventuali riferimenti a documenti e                     colare nelle fasi di impermeabilizzazione
        procedure pertinenti                                    straordinaria e ripresa del rivestimento im-
                                                                permeabile
  the model produces a much more accurate and detailed
summary.                                                   As for the image description:

Tables. As for interpreting table content, GPT-4omni                Descrivi il contenuto della figura
performs well in both document types. However, in the               12.8.4.2.5.c. Nella figura 12.8.4.2.5.c
railway document, the model falls into the “trap" question          dove va posizionato il bocchettone in
by attempting to answer the query about the value in the            HDPN?
fifth column of a table with only four columns. When
prompted with:                                                 GPT-4omni initially states that it cannot describe the
                                                             image as it does not exist:
       La figura 12.8.4.2.5.c non è visibile nel doc-      4. Conclusion
       umento condiviso, quindi non posso fornire
       una descrizione dei suoi contenuti specifici     We showed that GPT-4omni has a high potential for ana-
                                                        lyzing technical and regulatory documents. However, the
   However, in the subsequent question about the place- model tends to make factual errors, to generate inaccu-
ment of the nozzle, the model correctly described the rate details and to provide misleading answers supported
nozzle placement:                                       by technical explanations. These observations highlight
                                                        potential limitations when handling long and complex
         Nella figura 12.8.4.2.5.c, il bocchettone in   documents, and further research is needed to better un-
         HDPM va posizionato in corrispondenza dei      derstand and address these challenges. Our study has
         fori di scarico, come indicato nella figura    some limitations that should be considered.
         stessa                                            Limited Sample Size. The evaluation was based on a
                                                        dataset of four documents, which may not be representa-
Math Expressions. GPT-4omni demonstrates good tive of the broader range of technical documents.
performance in interpreting mathematical expressions in    Query Format. We employed a multi-question prompt
technical documents. However, in the railway document, format, grouping multiple questions within a single
the model made a mistake on the “trap" question asking prompt. We plan to explore an approach where each
about multiplication:                                   question is presented as an individual prompt.
                                                           Examining Positional Bias. There is a possibility that
         Come si interpreta il prodotto che è pre-      the answer location within the document (beginning,
         sente nell’espressione matematica?             middle, or end) might affect the model’s performance.
                                                           Contextual Sensitivity Analysis. The amount of context
   in a formula that did not have any multiplication:   provided could influence GPT in answering questions
                                                        related to specific document elements. We plan to sys-
         Il prodotto presente nell’espressione          tematically compare the model accuracy when presented
         matematica 11<n<40 non rappresenta             with the entire document versus just the relevant page
         un’operazione di moltiplicazione, ma           containing the answer.
         indica semplicemente che la variabile n           Playground vs. API Analysis. We primarily used the
         deve rispettare entrambi i limiti indicati     OpenAI API for evaluation. It would be valuable to ex-
                                                        plore whether analyzing documents through OpenAI’s
   This suggests that the model might have misinter-
                                                        Playground interface yields similar results.
preted the word “product" in the mathematical context.

Graphs. The results table shows a perfect score for        Acknowledgments
the railway document in interpreting graphs. There is
no data for the other documents.                           This work has been partially supported by the PNRR
                                                           project FAIR - Future AI Research (PE00000013), under
   This study suggests several practical applications of   the NRRP MUR program funded by the NextGenera-
LLMs in various sectors. Automating Compliance Checks      tionEU.
for Construction Projects: LLMs can help construction
companies review technical documents for safety regu-
lations and building codes. By analyzing specifications,
                                                           References
the model can identify parts that may comply with or [1] Y. Xiao, W. Y. Wang, On hallucination and predic-
violate local laws. While this can make compliance easier,   tive uncertainty in conditional language generation,
human experts must verify the model’s findings because       in: P. Merlo, J. Tiedemann, R. Tsarfaty (Eds.), Pro-
LLMs can make errors or generate false information.          ceedings of the 16th Conference of the European
Identifying Conflicting Procedures in Large Document         Chapter of the Association for Computational Lin-
Archives: Organizations with extensive procedural doc-       guistics: Main Volume, Association for Computa-
ument archives can use LLMs to find inconsistencies or       tional Linguistics, Online, 2021, pp. 2734–2744. URL:
conflicts between procedures. The model can scan large       https://aclanthology.org/2021.eacl-main.236. doi:10.
amounts of text and highlight contradictions, providing      18653/v1/2021.eacl-main.236.
a basis for human review. This helps companies resolve [2] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell,
discrepancies efficiently.                                   K. Saenko, Object hallucination in image caption-
                                                             ing, in: E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii
    (Eds.), Proceedings of the 2018 Conference on Empir-
    ical Methods in Natural Language Processing, Asso-
    ciation for Computational Linguistics, Brussels, Bel-
    gium, 2018, pp. 4035–4045. URL: https://aclanthology.
    org/D18-1437. doi:10.18653/v1/D18-1437.
[3] H. Yan, M. Ma, Y. Wu, H. Fan, C. Dong, Overview
    and analysis of the text mining applications in the
    construction industry, Heliyon 8 (2022) e12088. URL:
    https://www.sciencedirect.com/science/article/pii/
    S240584402203376X. doi:https://doi.org/10.
    1016/j.heliyon.2022.e12088.
[4] Y. Ding, J. Ma, X. Luo, Applications of natu-
    ral language processing in construction,           Au-
    tomation in Construction 136 (2022) 104169.
    URL:        https://www.sciencedirect.com/science/
    article/pii/S0926580522000425.             doi:https:
    //doi.org/10.1016/j.autcon.2022.104169.
[5] A. Shamshiri, K. R. Ryu, J. Y. Park,              Text
    mining and natural language processing in con-
    struction,       Automation in Construction 158
    (2024) 105200. URL: https://www.sciencedirect.com/
    science/article/pii/S0926580523004600. doi:https:
    //doi.org/10.1016/j.autcon.2023.105200.
[6] A. Erfani, Q. Cui, Natural language processing ap-
    plication in construction domain: An integrative
    review and algorithms comparison, 2022, pp. 26–33.
    doi:10.1061/9780784483893.004.
[7] OpenAI, Gpt-4 technical report, 2024. URL: https:
    //arxiv.org/abs/2303.08774. arXiv:2303.08774.
[8] A. Annicchiarico, Capitolato - parte ii - sezione 12 -
    ponti, viadotti, sottovia e cavalcavia images, Pubblica
    Amministrazione, 2020. URL: https://condivisionext.
    rfi.it/mimse/Documenti%20condivisi/PFTE%
    20Velocizzazione%20Roma-Pescara%20-%20Lotto%
    201%20-%20Interporto-Manoppello/Riscontro%
    20osservazioni%20Comitato%20Speciale%
    20CSLLPP/Integrazione%20documentale/1_
    Capitolato%20generale%20tecnico%20OOCC/
    Capitolato%20-%20Parte%20II%20-%20Sezione%
    2012%20-%20Ponti,%20Viadotti,%20Sottovia%20e%
    20Cavalcavia.pdf, accessed: July 18, 2024.
[9] R. Luciano, Riqualificazione punto natatorio,
    Comune di Lavis, 2016. URL: https://apl.provincia.tn.
    it/content/download/12939/230226/version/1/file/
    Riqualificazione+punto+natatorio.pdf, accessed:
    July 18, 2024.