=Paper= {{Paper |id=Vol-3878/63_main_long |storemode=property |title=Understanding High-complexity Technical Documents with State-of-Art Models |pdfUrl=https://ceur-ws.org/Vol-3878/63_main_long.pdf |volume=Vol-3878 |authors=Bernardo Magnini,Roberto Zanoli |dblpUrl=https://dblp.org/rec/conf/clic-it/MagniniZ24 }} ==Understanding High-complexity Technical Documents with State-of-Art Models== https://ceur-ws.org/Vol-3878/63_main_long.pdf

Understanding High-complexity Technical and Regulatory
Documents with State-of-the-Art Models: A Pilot Study
Bernardo Magnini1,**,‡ , Alessandro Dal Pozzo2 and Roberto Zanoli1
1
Fondazione Bruno Kessler, Trento, Italy
2
Rete Ferroviaria Italiana S.p.A, Italy

Abstract
We explore the potential of state-of-the-art Large Language Models (LLMs) to reason on the content of high-complexity
documents written in Italian. We focus on both technical documents (e.g., describing civil engineering works) and regulatory
documents (e.g., describing procedures). While civil engineering documents contain crucial information that supports
critical decision-making in construction, transportation and infrastructure projects, procedural documents outline essential
guidelines and protocols that ensure efficient operations, adherence to safety standards and effective incident management.
Although LLMs offer a promising solution for automating the extraction and comprehension of high-complexity documents,
potentially transforming our interaction with technical information, LLMs may encounter significant challenges when
processing such documents due to their complex structure, specialized terminology and strong reliance on graphical and
visual elements. Moreover, LLMs are known to sometimes produce unexpected or incorrect analyses, a phenomenon referred
to as hallucination. The goal of the paper is to conduct an assessment of LLM capacities along several dimensions, including
the format of the document (i.e., selectable text PDFs versus scanned OCR PDFs), the structure of the documents (e.g., number
of pages, date of the document), the graphical elements (e.g., tables, graphs, photos), the interpretation of text portions (e.g.,
make a summary), and the need of external knowledge (e.g., to interpret a mathematical expressions). To run the assessment,
we took advantage of GPT-4omni, a large multi-modal model pre-trained on a variety of different data. Our findings suggest
that there is great potential for real-world applications for high-complexity documents, although LLMs may still be susceptible
to produce misleading information.

Keywords
LLMs, GPT-4omni, Information extraction, Technical documents, Procedural documents, Civil engineering

1. Introduction uments are available either in PDF format as scanned
documents, or as PDFs processed with Optical Character
Technical documents employed in civil engineering con- Recognition (OCR) software, introducing an additional
tain information essential for planning, designing and layer of complexity due to potential variations in text
constructing structures that need to ensure safety and recognition quality. Finally, civil engineering technical
compliance with regulations. As an example, such high- documents are typically long, easily reaching hundreds
complexity documents provide technical guidelines for of pages. Figure 1 shows one of the many visual elements
managing the development of roads, bridges and other occurring in the technical documents (civil engineering
transport networks. Additionally, these documents are projects in Italian) considered in this study.
fundamental for public infrastructure projects, ensuring Similarly to technical documents, regulatory docu-
they serve the community effectively and safely. These ments play an equally important role across the same
documents are highly complex, particularly due to their sectors, as they outline the steps for managing incidents,
multi-modal nature, where textual content is mixed with supervising safety procedures and ensuring regulatory
several graphical content. The written content can vary compliance. For example, railway procedural documents
from simple explanations to very detailed technical in- contain comprehensive instructions on handling inci-
structions, often referring to specialized regulations. The dents and supervising safety measures, introducing addi-
visual elements typically include tables with numbers, tional complexity through procedural frameworks. Al-
math formulas and detailed drawings of engineering stuff, though procedural documents lack the visual complexity
as well as photos from natural environments and render- typical of technical projects, such as the presence of fig-
ing of a construction once realized. In addition, doc- ures, tables and graphs, they are dense with text, focusing
on legal and procedural details.
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, The paper investigates how state-of-the-art genera-
Dec 04 — 06, 2024, Pisa, Italy tive models are able to reason on the content of high-
*
Corresponding author. complexity technical and regulatory documents written
†
These authors contributed equally. in Italian. As generative models, both LLMs and Large
$ magnini@fbk.eu (B. Magnini); a.dalpozzo@rfi.it (A. Dal Pozzo);
Multimodal Models (LMMs), are rapidly becoming more
zanoli@fbk.eu (R. Zanoli)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License and more powerful, our research questions aim at as-
Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
2. Assessment Framework
We defined a series of questions to assess the model’s
proficiency in interpreting written text and visual con-
tent, including images and graphs. Table 1 lists queries
designed to evaluate how well the model understands tex-
tual content, assessing its performance across categories
like “Bibliographic Information", “Document Structure"
and “Text Interpretation". Similarly, Table 2 presents the
list of queries aimed at assessing the model’s ability to
interpret graphical content, including “Table", “Photo",
“Figure", “Mathematical Expression" and “Graph".
Additionally, we investigated the potential for the
model to experience hallucinations by making “trap"
Figure 1: Figure showing drainage outlets used at the junction questions designed to induce incorrect responses. For
points between the bituminous membrane and the rainwater example, a question such as “How tall is the pylon of
downpipe. the Zambana Vecchia-Fai della Paganella cableway men-
tioned in paragraph 12.6?" was posed, even though nei-
ther the specified paragraph nor the whole document
contains any information about cableways. Other in-
sessing their ability to extract and interpret key informa-
stances include queries like “What is the highest value
tion, this way reducing the need for manual reviews by
in the fifth column of Table 12.8.1-1?", despite the spec-
human experts. To this end, we have defined a simple
ified table having only 4 columns. Trap questions are
question-answer evaluation framework tailored to tech-
highlighted in bold in the tables.
nical and regulatory documents. As an example, we ask
Human evaluators subsequently reviewed and ana-
the model questions such as Provide a general summary
lyzed all responses provided by the model. Each response
of the technical specifications in the document and then we
generated by the model was evaluated based on the fol-
manually check the model answer. We also consider the
lowing scoring:
potential for LLMs/LMMs to generate content that is not
grounded to the document, an issue often referred to as • 2 points for fully accurate responses: the answer
model confabulations or hallucinations [1, 2]. To assess meets the prompt’s requirements completely,
confabulations we included “trap" questions mentioning such as providing a full list of figures or a compre-
non-existing objects in the document. Finally, the as- hensive summary of the document’s key content.
sessment considers both selectable text PDFs, which are • 1 point for partially correct responses: the an-
extractable and editable, and scanned OCR PDFs, where swer is incomplete, such as a list of figures miss-
text is derived from scanning or from OCR. ing some entries or a summary that covers some
A state-of-the-art survey on articles published between important points but omits others.
2000 and 2021, focusing on the applications of Text Min- • 0 points for incorrect responses: the answer fails
ing in the construction industry was presented in [3]. [4] to meet requirements, such as a mostly incom-
and [5] explored NLP application and development in con- plete or missing list of figures or a summary that
struction. Various machine learning and deep learning- does not accurately match the document’s con-
based NLP techniques, and their applications in construc- tent.
tion research, are documented in [6].
There are several potential real-world applications of
2.1. Model
LLMs in supporting and enhancing various sectors. Con-
struction firms can exploit LLMs to assist in reviewing For our experiments we use GPT-4omni[7], available
technical documents for safety regulations and building from OpenAI since April 2024, which represents a signifi-
codes, helping simplifying compliance checks. Addition- cant advance in AI innovation by becoming the first truly
ally, organizations with large document archives can multimodal model capable of interpreting and generating
leverage LLMs to identify potential inconsistencies or various types of data, including text, images and audio.
conflicts in procedures, providing valuable insights for
further human review and ensuring adherence to unified 2.2. Dataset
operational protocols.
The dataset for our pilot experiments includes four high-
complexity documents, two are technical specifications
and two are regulatory documents. More specifically:
Table 1
Questions (in Italian) used to test the model’s capacity to reason on textual content. “Trap" questions are highlighted in bold.

Content Question
1. Bibliographic Estrai il nome completo degli autori del documento. Estrai il titolo completo del documento. Estrai la
Information data di pubblicazione del documento.
2. Document Riporta l’esatto numero di pagine del documento. Riporta l’indice delle tabelle presenti nel documento.
Structure Riporta l’indice delle figure presenti nel documento.
3. Text Interpre- Documento: Fai un riassunto generale del capitolato tecnico. Quali normative e regolamenti devono
tation essere rispettati secondo il capitolato tecnico? Qual è la timeline del progetto come delineata nel
capitolato tecnico? Qual e’ la lunghezza della fune portante della funivia descritta nel capitolato
tecnico?
Paragrafo: Riassumi il paragrafo II.12 PROCESSO DI CONDIVISIONE DELLE INDAGINI del documento
seguente utilizzando un linguaggio tecnico. Includi tutte le informazioni pertinenti e fornisci un livello
di dettaglio approfondito. Indica chiaramente eventuali riferimenti a documenti e procedure pertinenti.
Come sono suddivise le attività di manutenzione ordinaria?

Table 2
Questions (in Italian) used to test the model’s capacity to reason on pictures, graphs and tables. “Trap" questions are in bold.

Content Question
4. Table Qual è il valore richiesto della resistenza a rottura per trazione su un provino longitudinale per la mem-
brana inferiore da 4 mm? Cosa rappresenta la Tabella 12.8.1-2? Quali caratteristiche della membrana
sono riportate nella Tabella 12.8.1-1 rispetto alla Tabella 12.8.1-2? Quale è il valore più alto nella
quinta colonna della Tabella 12.8.1-1?
Per quante tipologie di eventi di cui alla tabella allegato 9 è previsto l’invio dell’Avviso di Accadimento
(AA)?
5. Photo Descrivi gli oggetti o le persone presenti nella figura 12.8.4.2.6.a? Il tubo verde nella figura passa sopra
oppure sotto alla rotaia? Quanti alberi ci sono nella figura?
6. Figure Descrivi il contenuto della figura 12.8.4.2.5.c. Nella figura 12.8.4.2.5.c dove va posizionato il bocchettone
in HDPN? Cosa rappresenta l’oggetto di colore rosso presente nella figura?
7. Mathematical Descrivi a cosa fa riferimento l’espressione matematica 11 ≤ 𝑛 ≤ 40 riportata nella tabella Tabella
Expression 12.14.3.7. Cosa significa il simbolo ≤ nell’espressione matematica? Come si interpreta il prodotto
che è presente nell’espressione matematica?
8. Graph Cosa è rappresentato nel grafico di figura 1? Cosa rappresenta l’asse delle X e l’asse delle Y del grafico?
Quale unità di misura è utilizzata per esprimere i valori sull’asse delle Y? A quale valore della curva
del grafico corrisponde il valore 100 delle X?

• A 96-page technical specification document As far as the content of the four documents, the first
for civil engineering works from the Italian page provides general information (bibliographic) about
railways[8]. the document, including publication date and authors.
• A 32-page document on the design of an outdoor An example is reported in Figure 2.
swimming pool in Trentino-Alto Adige[9].
• A 49-page regulatory document from RFI out-
linimg procedures for investigating railway inci-
dents.
• A 12-page regulatory document from RFI focus-
ing on managing prescriptions and supervising
activities by ANSFISA (Agenzia Nazionale per la
Sicurezza Ferroviaria). Figure 2: Each document’s first page contains bibliographic
information.
The two technical documents are licensed for unre-
stricted use in non-commercial, educational, or research
contexts. In contrast, the two procedural documents re- Furthermore, the documents contain a combination of
lated to the Italian railway system are intended only for photos, figures and tables, exemplified by Figures 1, 3, 4,
internal RFI use and cannot be distributed. respectively. These visual elements are important for
explaining technical details and the logical structure of Table 3
procedures, often substituting written descriptions. This Statistics on the documents used for assessment.
means that the model frequently needs to interpret these Tech. Docs Reg. Docs
visual elements without relying on explanations provided
in the text. Content Railway Pool Railway Railway
Pages 96 32 49 12
Tables 20 4 14 0
Photo 2 2 0 0
Figure 31 19 2 0
Graph 2 0 0 0

Figure 3: Photo showing a worker applying the waterproof
membrane.

Figure 5: Formula representing the number of constraint
mechanisms (restraints) required to be tested according to the
specifications outlined in the chapter.

Figure 4: Excerpt of the table reporting the characteristics of
the 4mm lower membrane.

An important feature of our dataset is that it includes
both selectable PDF and scanned OCR PDF. More specif-
ically, the three RFI documents are selectable text PDF,
where the text is digital, searchable and can be copied,
typically created by word processors or digital publishing
software. These documents contain pages with tables and
figures, with some tables spanning multiple pages and Figure 6: Graphic representing melting of the stiffness of
others presented as images. Certain figures and tables elastic devices of bearing devices.
include captions, while others do not. The documents
also includes formulas and graphics, such as those in
Figures 5 and 6. On the other hand, the swimming pool which are internal to RFI, it was not necessary. For the
document is a scanned OCR PDF, which is not directly contamination test, we masked document elements, such
selectable and searchable. Some pages in this document as numbers and paragraph identifiers in the text, and
are misaligned compared to the standard orientation, and asked the model to fill in these gaps. For instance, we
it also includes tables and figures across the document. prompted the model with tasks like “Replace the MASK
Table 3 shows a comparison of the key characteristics marker with the missing paragraph number in the fol-
of these documents. lowing text". Results indicate that the model was unable
to identify the missing words, suggesting that it is likely
2.3. Contamination Test to have not encountered these documents in the pre-
training phase. Moreover, even if prior exposure to the
We ran a contamination test to verify that GPT-4omni did documents could improve GPT’s performance, its unfa-
not use in its pre-training the documents of our dataset. miliarity with the specific questions and answers should
The test was carried out on two publicly available tech- limit its accuracy in responding.
nical documents, while for the regulatory documents,
2.4. Experimental Setup Table 4
Results (accuracy) on regular questions. The overall accuracy
There are two modalities to query GPT-4omni: using the on the dataset is 85.83%.
OpenAI playground or the OpenAI API. We used the API
because it allows for quickly scaling from analyzing a few Tech. Docs Reg. Docs
documents to tens or thousands automatically, whereas Content Railway Pool Railway Avg.
with the playground documents must be uploaded manu- Biblio. Info. 1.00 1.00 1.00 1.00
ally one at a time. We used OpenAI API version 1.34.0 in Doc. Struct. 0.50 0.67 0.92 0.75
conjunction with GPT-4omni version gpt-4o-2024-05-13. Text Interp. 0.80 1.00 0.62 0.76
Since GPT-4omni is not deterministic, even with tem-
Table 1.00 1.00 0.80 0.90
perature set to 0, we kept all default parameters of the
Photo 0.50 1.00 - 0.75
model. Figure 0.50 1.00 - 0.75
The PDF documents were first converted, using the Math Exp. 1.00 1.00 - 1.00
free online tool PDF24, into images, as PDF format in- Graph 1.00 - - 1.00
puts are not currently supported GPT-4omni API. This
contrasts with the playground, where PDF uploads are
allowed. Each document’s page was transformed into an Table 5
image, using the PNG format and setting the resolution to Results (accuracy) on “trap" questions. The overall accuracy
on the dataset is 80.25%.
300 DPI to ensure high-quality reproduction of the origi-
nal document pages. For each document, the images were Tech. Docs Reg. Docs
then uploaded by the OpenAI API in the exact sequence Content Railway Pool Railway Avg.
of their respective pages. Regarding the prompt used for
querying the model, we used the following: Rispondi alla Biblio. Info. - - - -
Doc. Struct. - - 1.00 1.00
seguente domanda basandoti sul capitolato tecnico fornito,
Text Interp. 0.50 1.00 0.71 0.71
senza usare alcuna conoscenza preliminare.
We tested GPT-4omni’s non-deterministic behavior by Table 0.00 1.00 1.00 0.75
making five requests per question set, using the shorter Photo 1.00 1.00 - 1.00
swimming pool document (32 pages), to avoid potential Figure 0.00 1.00 - 0.50
Math Exp. 0.00 1.00 - 0.50
server time-outs. For each set of questions, GPT-4omni
Graph 1.00 - - 1.00
we assessed how consistent the answers are with each
other on a scale from 0 (inconsistent) to 1 (consistent).
The average consistency score across 8 question sets was
0.85. 3.1. Discussion
As of writing time (June 2024), the cost of process- Results allow us to draw the following conclusions re-
ing one prompt for one document in our dataset using garding GPT-4omni’s ability to understand textual and
the OpenAI API is approximately $0.50. Processing time visual content for each question category.
also needs to be considered. For instance, querying GPT-
4omni for the longer document (96 pages) takes an aver- Bibliographic Information. A perfect score for both
age of 3 minutes and 20 seconds. technical and regulatory documents indicates that the
model consistently retrieved bibliographic information
3. Results and Discussion (author, title, date) accurately.

GPT-4omni achieves an average accuracy of 83,66% on Document Structure. GPT-4omni is not perfect at de-
textual content and 88,00% on visual content, resulting in tecting the structure of the documents. For example, the
an overall accuracy of 85.83%. However, accuracy drops model sometimes includes invented entries or omits the
significantly, to 80,25%, when presented with questions entire index of the technical railway documents. This
specifically designed to induce errors (“trap" questions). could be attributed to the document’s complexity, con-
GPT-4omni’ scores for both textual content and graphical taining lengthy table labels (e.g., Table 12.8.2.1-1), a large
elements, ranging from 0 (indicating no accuracy) to number of figures and tables (51), the absence of captions
1 (indicating perfect accuracy) are provided separately for some of them, and a high page count (96). We observe
for regular questions (Table 4) and for “trap" questions that the model is highly sensitive to the prompts used.
(Table 5). For instance, when prompted with:

Report the number of tables present in
the document
for a regulatory document, the model inaccurately Quale è il valore più alto nella quinta
returns a result of just one table. In contrast, when we colonna della Tabella 12.8.1-1?
refined the prompt as:
the model produced:
Identify all the tables present in the
following document. For each table Nella quinta colonna della Tabella 12.8.1-1,
found, provide the page number where che rappresenta le tolleranze, il valore più
it is located and the total number of alto è ± 20% per la resistenza a rottura per
tables in the document trazione su provino longitudinale e trasver-
sale, e per la stabilità di forma a caldo
the model accurately lists the tables along with their
corresponding pages and correctly identifies six tables. despite the absence of a fifth column. The model’s
As for the pool document, the model did not extract the answer was so detailed that, without verifying the docu-
exact number of pages, likely due to the absence of page ment, even a human might find it difficult to recognize
numbers. that the response had no basis in the actual content.

Text Interpretation. The model performs better in Photos. GPT-4omni can describe objects in photos of
the pool document than on the railway documents in documents. However, in the railway technical document,
text interpretation. In particular, GPT-4omni makes a the model missed crucial details, like a green tube and
mistake in a paragraph-level “trap" question. When asked railway track in Figure 3, which affected its ability to
about the height of the cable car pylon mentioned in answer questions about their relative position. Interest-
paragraph 12.6, the model incorrectly claims it was 43 ingly, when provided only with the relevant page, the
meters tall, despite neither the paragraph nor the entire model correctly identified all objects, including key ele-
document containing any references to cable cars. As ments, suggesting that the document context or photo
in the previous case, we found that the model is highly placement may influence its comprehension of images.
sensitive to prompt phrasing. For example, when asked
to: Figures. In the railway document, GPT-4omni fell into
a “trap" question and additionally failed to accurately
Riassumi il contenuto del paragrafo describe the content of Figure 1. Regarding the “trap"
II.12 PROCESSO DI CONDIVISIONE question:
DELLE INDAGINI
Cosa rappresenta l’oggetto di colore
the model provides a somewhat brief and general re- rosso presente nella figura?
sponse. However, when the prompt was made more the model identified a red object that was not present
specific, such as: in the figure and even provided a detailed description of
it:
Riassumi il paragrafo II.12 ’PRO-
CESSO DI CONDIVISIONE DELLE L’oggetto di colore rosso presente nella
INDAGINI’ del documento seguente figura rappresenta il "rivestimento imper-
utilizzando un linguaggio tecnico. meabile" dello strato di colore rosso, come
Includi tutte le informazioni perti- indicato nella legenda e nel contesto delle
nenti e fornisci un livello di dettaglio spiegazioni tecniche fornite nella sezione
approfondito. Indica chiaramente pertinente del capitolato tecnico, in parti-
eventuali riferimenti a documenti e colare nelle fasi di impermeabilizzazione
procedure pertinenti straordinaria e ripresa del rivestimento im-
permeabile
the model produces a much more accurate and detailed
summary. As for the image description:

Tables. As for interpreting table content, GPT-4omni Descrivi il contenuto della figura
performs well in both document types. However, in the 12.8.4.2.5.c. Nella figura 12.8.4.2.5.c
railway document, the model falls into the “trap" question dove va posizionato il bocchettone in
by attempting to answer the query about the value in the HDPN?
fifth column of a table with only four columns. When
prompted with: GPT-4omni initially states that it cannot describe the
image as it does not exist:
La figura 12.8.4.2.5.c non è visibile nel doc- 4. Conclusion
umento condiviso, quindi non posso fornire
una descrizione dei suoi contenuti specifici We showed that GPT-4omni has a high potential for ana-
lyzing technical and regulatory documents. However, the
However, in the subsequent question about the place- model tends to make factual errors, to generate inaccu-
ment of the nozzle, the model correctly described the rate details and to provide misleading answers supported
nozzle placement: by technical explanations. These observations highlight
potential limitations when handling long and complex
Nella figura 12.8.4.2.5.c, il bocchettone in documents, and further research is needed to better un-
HDPM va posizionato in corrispondenza dei derstand and address these challenges. Our study has
fori di scarico, come indicato nella figura some limitations that should be considered.
stessa Limited Sample Size. The evaluation was based on a
dataset of four documents, which may not be representa-
Math Expressions. GPT-4omni demonstrates good tive of the broader range of technical documents.
performance in interpreting mathematical expressions in Query Format. We employed a multi-question prompt
technical documents. However, in the railway document, format, grouping multiple questions within a single
the model made a mistake on the “trap" question asking prompt. We plan to explore an approach where each
about multiplication: question is presented as an individual prompt.
Examining Positional Bias. There is a possibility that
Come si interpreta il prodotto che è pre- the answer location within the document (beginning,
sente nell’espressione matematica? middle, or end) might affect the model’s performance.
Contextual Sensitivity Analysis. The amount of context
in a formula that did not have any multiplication: provided could influence GPT in answering questions
related to specific document elements. We plan to sys-
Il prodotto presente nell’espressione tematically compare the model accuracy when presented
matematica 11