=Paper=
{{Paper
|id=Vol-3741/paper71
|storemode=property
|title=Exploring Large Language Models for Procedure Extraction from Documents
|pdfUrl=https://ceur-ws.org/Vol-3741/paper71.pdf
|volume=Vol-3741
|authors=Anisa Rula,Jennifer D’Souza
|dblpUrl=https://dblp.org/rec/conf/sebd/Rula024
}}
==Exploring Large Language Models for Procedure Extraction from Documents==
<pdf width="1500px">https://ceur-ws.org/Vol-3741/paper71.pdf</pdf>
<pre>
                                      .

                                Exploring Large Language Models for Procedural
                                Extraction from Documents
                                Anisa Rula1,* , Jennifer D’Souza2
                                1
                                    University of Brescia, Brescia, Italy
                                1
                                    TIB Leibniz Information Centre for Science and Technology, Hannover, Germany


                                              Abstract
                                              Recent advancements in Natural Language Processing (NLP), notably the emergence of extensive language
                                              models pre-trained on vast datasets, are opening new avenues in Knowledge Engineering. This study
                                              delves into the utilization of these large language models (LLMs) in two learning scenarios - zero-shot
                                              and in-context learning - to address the extraction of procedures from unstructured PDF texts through
                                              incremental question-answering techniques. Specifically, we employ the cutting-edge GPT-4 (Generative
                                              Pre-trained Transformer 4) model, alongside two variations of in-context learning methodologies. These
                                              methods incorporate an ontology with definitions of procedures and steps, as well as a limited set
                                              of samples for few-shot learning. Our investigation underscores the potential of this approach and
                                              underscores the significance of tailored in-context learning adaptations. These adjustments hold promise
                                              in mitigating the challenge of acquiring adequate training data, a common obstacle in deep learning-based
                                              NLP methods for procedure extraction.

                                              Keywords
                                              Procedural knowledge, knowledge graphs, large language models, knowledge capture


                                1. Introduction
                                Extracting complex knowledge from unstructured sources is a challenge, particularly focusing
                                on industrial troubleshooting documents. These documents often contain detailed procedures
                                represented as sequences of steps, which vary in textual form, making it challenging for
                                automated algorithms to identify and organize the relevant information accurately. Despite
                                advancements in Natural Language Processing (NLP), the scarcity of training data remains
                                a significant obstacle for machine learning approaches. Consequently, novel methods are
                                emerging, such as interactive dialogues and language models, to address this challenge [1].
                                   The paper emphasizes the importance of extracting relevant procedures, using the example
                                of a shop floor operator needing to follow maintenance procedures for gear head lathe
                                machinery. It outlines a typical sequence of activities involved in maintenance and highlights
                                the importance of correct execution for optimal machine performance (see Figure 1). The shop
                                floor worker may need to answer some questions for which a simple keyword-based search in
                                the document is not sufficient:

                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
                                *
                                 Corresponding author.
                                $ anisa.rula@unibs.it (A. Rula); jennifer.dsouza@tib.eu (J. D’Souza)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Example of a procedure.


- What are the steps involved in performing routine maintenance on machinery gear head lathe?
- Are there any sub-procedures or specialised steps for troubleshooting specific issues during
maintenance procedures for machinery gear head lathe?

   To address the limitations of keyword-based search methods, it’s essential to extract and depict
procedural knowledge using a vocabulary that encompasses domain-specific terms and concepts.
This extraction and representation process is supported by an in-context learning strategy,
allowing for the customization of large language models (LLMs) with minimal training data.
Integrating this approach makes knowledge engineering more accessible to individuals lacking
expertise in formal representation languages. This structured knowledge can then be queried
for semantic procedural data, enabling responses to queries that were previously impossible
with unstructured text. By querying over semantically structured procedures, computers can
intelligently assist users in efficiently managing, comprehending, and executing procedures.
   The paper is structured as follows: section 2, we propose our approach and discuss our
experimental results in section 3. Finally, a brief discussion on related work is offered in
section 4 and concluding remarks in section 5.


2. Approach
This section provides an overview of the methodology adopted to construct different versions
of question-answering systems through in-context learning. The primary focus during the
design of conversational interactions was on extracting procedures, including their constituent
steps and substeps as these form the fundamental components of any procedure. To create the
conversational system, we employed GPT-4 (Generative Pre-trained Transformer 4.0) hosted on
ChatGPT-Plus, along with the https://askyourpdf.com/ plugin for GPT-4. This choice was based
on GPT-4 as a cutting-edge standard in LLMS. However, given that raw pre-trained language
models may not seamlessly align with specific tasks, we employed in-context learning through
prompting to customise the native model to varying extents. The following sections will provide
detailed insights into the strategies adopted for formulating query templates and the customised
models functioning as conversational systems.

2.1. Prompting and questioning
Figure 2 illustrates five distinct queries formulated for information extraction purposes. The
questions, and therefore construction, are posed incrementally. First, we ask questions about the
list of steps and substeps and then we ask questions regarding aggregations and comparisons,
and finally, the precedence relations among steps. We discuss shortly each of these templates:


Figure 2: Question templates and their ordering


    • Template 1 (List): Requests a detailed list of steps for a specific procedure, including any
      step or substep for each main step.
    • Template 2 (Counting): Asks for the total number of steps within a procedure.
    • Template 3 (Comparison): Aims to identify the procedure with the maximum number of
      steps among a set of procedures.
    • Template 4 (Nested Procedures): Asks about the presence of sub-procedures or substeps
      within the main procedure.
    • Template 5 (Sequence): Asks for the next step after a specific step in a procedure and
      provides the step that comes immediately after the specified step.

These questions were structured incrementally to address the model’s limitations in handling
complex queries that encompass the entire procedural structure. This incremental approach
shares similarities with the iterative process of crafting conceptual models often involving
interactions with domain experts. It paves the way for versatile pipelines by combining diverse
incremental inquiries.
2.2. In-context learning customisations
"Contextual learning" involves training models to understand the contextual environment
within which information is presented. This context can encompass surrounding text, images,
or other data facets. Contextual learning is particularly important for tasks like NLP, where the
meaning of a word or phrase can vary based on its surrounding context.
   Learning approaches: Raw vs Zero-shot vs 2-shot. The initial learning approach aligns
with "zero-shot learning." where the pre-trained GPT-4 model can generalise its knowledge to
tasks it has not been explicitly trained on but gains its understanding by providing the model
with some initial context. In contrast "2-shot learning" falls between traditional supervised
learning and zero-shot learning. This approach involves training a machine learning model
with just two labelled examples per class, enabling the model to generalise and make predictions
for new instances even with limited labelled data.
   Contextual knowledge definitions. Contextual knowledge is provided through the iden-
tification of the specific domain and intensional definitions of the procedure elements to be
extracted. Relying on intensional definitions offers the advantage of compactness without re-
quiring the provision of examples. These definitions consist of the main concepts and properties
defined in the ontology along with definitions of Procedure, Steps and Substeps. Each definition
is labelled with the question it was used for. These choices were made to minimise external
knowledge while providing an initial empirical assessment of using intentional definitions in
the customisation of pre-trained models.


3. Evaluations
In this section, we discuss the results obtained by leveraging ChatGPT4 for procedural text
mining w.r.t. the 5 prompting scenarios introduced in subsection 2.1.

3.1. Datasets
To understand the effectiveness of our approach we select four domains where public data are
easily accessible and which cover all the challenges.

    • Photography, manuals that provide instructions for operating, maintaining, and trou-
      bleshooting cameras, covering settings, capturing images or videos, lens care, and handling
      various scenarios.
    • Manufacturing, manuals that provide instructions for production, quality control, assem-
      bly, maintenance, and safety in various manufacturing operations.
    • Medicine, manuals that provide instructions for dental instrument usage, sterilisation,
      X-rays, oral hygiene instructions, and emergency protocols.
    • Agriculture, manuals that provide instructions for the operation, maintenance, and safety
      protocols of farm equipment.

  We now give details on how we extracted the documents from each data source. First, we
examine all the procedures defined in the manuals. Second, we specifically choose procedures
that adhere to a structured format suitable for enumeration, such as numbering, bullet points, or
clear indentation. Third, we prioritise procedures that are only on one page or at most spanned
across two consecutive pages. For each domain, we extract three examples of procedures either
from the same manual or different manuals.

3.2. Qualitative Evaluations
The qualitative evaluations are discussed in terms of 9 observations presented as questions.
   1. With the 2-shot in-context learning setting, can one tailor the model to respond in a certain
way or a certain format? E.g., for prompt scenario 5 in medicine, when asked for the next
instruction in sequence, we would like the agent to reply with just the name or the sentence
corresponding to the next instruction. However, the model in the "raw" setting seems to respond
with the next instruction, but also with the substeps or additional information like notes or
warnings related to the instruction. See response where it also adds the sentence “Refer to
page 17 of the document.” which is additional information in the context of the instruction
sentence “CONTROL BOX INSTALLATION”. However, in the 2-shot setting, the model after
seeing reference examples, responds with just the essential information for the same query as
shown in context. See response now reads “... the next step after Step 4 "Head Installation" is
Step 5 "Control Box Installation".”.
   2. How does ChatGPT4 handle the extraction of a procedure across pages? In the manufacturing
domain, the instructions for "support plate installation" span two pages. This is example 2 han-
dling the “Support plate installation” procedure in the set of procedures. The chatgpt response
for the type 1 prompt to list all steps, substeps etc. accurately reflected the expected gold-
standard. Thus we see that the agent can assimilate information across pages while maintaining
the right context. As another example from the manufacturing domain is example 4 in the
context of type 1 prompt for the "Removal and Installation of Mechanical Seal" sub-procedure for
the "Shaft-seal maintenance" procedure. Here again, the chatgpt response aside from splitting
some instruction types, is 90% in accordance to the gold-standard. It has successfully extracted
the first step from the first page and the remaining steps from the next page of the manual.
For the same instruction, when prompted in the 2-shot setting with examples of the desired
response provided in the prompt, the chatgpt response exactly matches the gold-standard. This
behaviour is consistently observed for other domains as well.
   3. Apart from text generation discrepancies, has ChatGPT4 completely overcome the limitation
of LLMs of not being truly capable of mathematical logic or reasoning but simply still relegated
as very powerful statistical text generators [2, 3]? In the manufacturing domain, for type 3
comparison prompts, the language model was asked to compare the number of instructions
given two contexts with procedures and reply which context had more instructions. Intriguingly,
one of the contexts contained two procedures. Thus the task of the language model was to
consider each independent procedure within each context and return which one had the most
instructions. The incorrect model response over a relatively simple reasoning task offers further
credence to the conjecture: are large language models simply very good statistical generators
and otherwise incapable of truly reasoning?
   4. Are our instructions completely unambiguous to the model? We observed that in some cases
they might be ambiguous. For instance, in prompt 4, i.e. probing the model to produce nested
instructions setting, our prompt reads as follows:
                      Question: Can you provide a detailed list of the sub-
                      steps of Step X in the given Context which refers to the
                      "[name]" procedure? If there are no substeps, please re-
                      ply with "no substeps". Answer:


   Sometimes, if the main instruction itself is a rather long sentence, e.g., the generated step 7
here, the prompt above proves ambiguous where ChatGPT4 splits the long instruction into a
sequence of steps as in this response. In-context learning alleviates ambiguity. In our 2-shot
setting, the same prompt results with the correct response
   5. Can ChatGPT4 effectively extract information from manuals in a 2-column format, processing
each column accurately?
   We find that it can extract content to create a response cleanly column-by-column. However, if
queried about a procedure described in the first column, it may not be able to detect the end of the
procedure as relegated just to that one column. It could continue generating text even including
a new procedure starting in the second column of the same page. E.g., the following manual
on operating tractors has two distinct procedures, i.e. Operating the Hydrostatic Transmission
(listed completely in column 1) and Using Cruise Control - 1026R (listed completely in column
2). In the prompt 1 scenario, ChatGPT4 when asked to list steps for “Operating the Hydrostatic
Transmission” (see example 1), it successfully extracts the relevant text but continues extracting
text even for the “Cruise Control” procedure and lists its steps as substeps to the last step for
“Operating the Hydrostatic Transmission.” See response here.
   6. Is ChatGPT4 able to comprehend the correct application of the ontology even though the
generated response does not match the gold-standard?
   A step in the ontology is described as follows: first given an instance name and initialised
as a Step type of a Plan. E.g., “kh-p-instance:Step2 a p-plan:Step ;” Next the step is assigned
a label. E.g., “rdfs:label "Attach the hoses to the flowmeter ;”. Then the next step to the given
step is specified. This can be a substep if the given step has substeps. E.g. “kh-p:nextStep
kh-p-instance:SubStep2_1 ;”. Then the step in question is initialised as an instance of the
corresponding main plan it belongs to. E.g., “kh-p:isStepOfPlan kh-p-instance:Plan1 ;”. For
those steps with substeps, the name of a subplan is specified. E.g., “kh-p:isDecomposedAsPlan
kh-p-instance:SubPlan2 .” This subplan will be the plan to which substeps of the main plan are
initialised. For a complete example, see lines 48 to 52 in the gold-standard example 1 in the
medical domain. Note if a step does not have substeps then the specification of the next step goes
to the next main step and not the substep. In addition, the line specifying the decomposed plan
of a step will not be present. Now looking at the ChatGPT4 response for the same procedure,
and the same instruction step in lines 17 to 27, we see that it has incorrectly specified the
step in turn leading to an incorrect application of the ontology. First as a next step, instead
of specifying the sub step, it used the subplan. Then for the substeps instead of specifying
the decomposed plan for the main step at the step specification, it specifies it at the plan level
which was initialised as the next step of the main step leading to something meaningless and
not machine-actionable.
   An incorrect application of the ontology is also found in chatgpt response to procedure 3
“Installation of FM Type.” Specifically, take a look at the “Pole Assembly Installation” main step
lines 12 to 23. The main step is initialised as a type of Plan. There is no specification of the
decomposed plan with its substeps. Thus in the zero-shot setting, ChatGPT4 cannot be expected
to correctly apply the ontology. Promising enough, this changes in the 2-shot setting, where
ChatGPT4 shows two examples of the correct application of the ontology. Then via in-context
learning, it is able to correctly apply the ontology. See the ChatGPT4 response in the 2-shot
setting for the same example, as a perfect application of the ontology, thereby showing that
ChatGPT4 can be successfully guided via in-context learning toward the correct application of
an ontology. Thus in a sense, at least for the ontology setting, it appears necessary to query
ChatGPT4 via the in-context learning methodology showing it some examples with the correct
application of the ontology.
   As an observational note, in simpler ontology application scenarios, i.e. when there are just
steps with no substeps, it does very well. E.g. from agriculture, for example 4, the ChatGPT4
ontologised response for the prompt 1 scenario to list steps is almost identical to the gold-
standard.
   7. Has ChatGPT4 hallucinated? One needs to still be wary of the use of ChatGPT4 as it can
still entirely hallucinate content. Consider the prompt 4 listing of nested procedures scenario, in
the 2-shot setting, despite precise instructions as well as in-context example, when asked to list
the substeps of step 3 for “installation of the FM type” procedure, ChatGPT4 still hallucinated
the entire response. Compare this with the expected answer for substep3_1, 3_2, and 3_3.
   8. Is the 2-shot setting infallible, or does the model occasionally produce unexplained hallucina-
tions? In the 2-shot setting, we have also observed scenarios where the model has hallucinated
text. While it may have grasped the ontology components and application relatively well, for
reasons we found unexplainable the text generated as steps was entirely made up and could not
be found in the manual. E.g., the generated ChatGPT4 response in the 2-shot setting compared
with the ontologised gold-standard or text-based gold-standard.


4. Related Work
In prior research, it’s vital to examine the methods used for procedural text mining and the
incorporation of Large Language Models (LLMs) for knowledge extraction.
   Knowledge Extraction from Unstructured Sources. Extracting complex knowledge
from unstructured sources presents several challenges in several domains. This variability
complicates the accurate extraction and structuring of relevant information through knowledge
extraction algorithms which are usually applied to specific domains [4]. The intricate nature
of these documents requires manual review by domain experts after automated extraction,
underscoring the limitations of machine-learning-based approaches [5]. Innovative methods,
including interactive dialogues and language models, have emerged to address the lack of readily
available training data for machine learning methods [6, 7].
   Procedural Text Mining and Large Language Models (LLMs). In response to the chal-
lenges mentioned earlier, our research delves into the field of procedural text mining, capitalizing
on advancements in Natural Language Processing (NLP). Large Language Models (LLMs) have
emerged as a pivotal tool in this endeavour, surpassing the capabilities of traditional symbolic
AI and machine learning technologies [8, 9]. These models offer a means to address the intri-
cate and extensive nature of procedural documents, with the potential to enhance knowledge
extraction efficiency.
   Integration of LLMs in Knowledge Extraction Large Language Models (LLMs) demon-
strate exceptional capabilities in natural language processing, surpassing what conventional
symbolic AI and machine learning technologies can achieve [10]. These capabilities have
sparked a substantial increase in proofs of concept and practical applications of LLMs, suggest-
ing their potential utility in various knowledge-related tasks [11]. Nevertheless, the exploration
of methods for effectively integrating LLMs into structured, controllable, and repeatable ap-
proaches for the development and deployment of such applications in production is still in its
early stages and requires further detailed consideration [12]. Similarly, our study centers on
the integration of LLMs, notably the state-of-the-art GPT-4 model, in the context of extracting
procedural knowledge from unstructured PDF documents.
   Our solution proposes an innovative approach of combining large language models, specifi-
cally GPT-4, with in-context learning strategies to address the challenges of extracting procedu-
ral knowledge from unstructured PDF documents, offering a novel solution to the scarcity of
training data in deep learning-based NLP techniques for procedure extraction.


5. Conclusion
In this study, we explored the feasibility of employing in-context learning with pre-trained
language models to extract procedure elements from textual documents. We examined the
native GPT-4 model and two customised variants, fine-tuned with procedure element definitions
and limited examples. While GPT-4 is touted as the current most potent Large Language Model
(LLM) with an expansive parameter count exceeding a trillion, its application is limited due to
its proprietary nature. Notably, it also lacks the capacity for fine-tuning as of the time of writing
this paper. In forthcoming research, we propose evaluating open-source LLMs, such as those
related to T5 [13, 14, 15, 16] or Llama [17, 18]. This approach offers two key advantages: firstly,
the adoption of open-source models would promote research democratization by removing
paywalls as a barrier, and secondly, it facilitates potential enhancements, as these open models
provide comprehensive technical insights into their pretraining datasets and strategies.


References
 [1] P. Bellan, M. Dragoni, C. Ghidini, Process extraction from text: state of the art and
     challenges for the future, CoRR abs/2110.03754 (2021).
 [2] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma,
     D. Zhou, D. Metzler, et al., Emergent abilities of large language models, arXiv preprint
     arXiv:2206.07682 (2022).
 [3] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. San-
     toro, A. Gupta, A. Garriga-Alonso, et al., Beyond the imitation game: Quantifying and
     extrapolating the capabilities of language models, arXiv preprint arXiv:2206.04615 (2022).
 [4] M. Y. Jaradeh, K. Singh, M. Stocker, A. Both, S. Auer, Better call the plumber: Orchestrating
     dynamic information extraction pipelines, in: ICWE, volume 12706 of Lecture Notes in
     Computer Science, Springer, 2021, pp. 240–254.
 [5] Z. Dong, S. Paul, K. Tassenberg, G. Melton, H. Dong, Transformation from human-readable
     documents and archives in arc welding domain to machine-interpretable data, Comput.
     Ind. 128 (2021) 103439.
 [6] A. Rebmann, H. van der Aa, Extracting semantic process information from the natural
     language in event logs, in: CAiSE, volume 12751, Springer, 2021, pp. 57–74.
 [7] P. Bertoli, F. Corcoglioniti, C. D. Francescomarino, M. Dragoni, C. Ghidini, M. Pistore,
     Semantic modeling and analysis of complex data-aware processes and their executions,
     Expert Syst. Appl. 198 (2022) 116702.
 [8] K. Sun, Y. E. Xu, H. Zha, Y. Liu, X. L. Dong, Head-to-tail: How knowledgeable
     are large language models (llm)? a.k.a. will llms replace knowledge graphs?, 2023.
     arXiv:2308.10168.
 [9] P. Bellan, M. Dragoni, C. Ghidini, Extracting business process entities and relations from
     text using pre-trained language models and in-context learning, in: Enterprise Design,
     Operations, and Computing - 26th International Conference, EDOC 2022, Bozen-Bolzano,
     Italy, October 3-7, 2022, Proceedings, volume 13585 of Lecture Notes in Computer Science,
     Springer, 2022, pp. 182–199. doi:10.1007/978-3-031-17604-3\_11.
[10] K. Mahowald, A. A. Ivanova, I. A. Blank, N. Kanwisher, J. B. Tenenbaum, E. Fedorenko,
     Dissociating language and thought in large language models: a cognitive perspective, 2023.
     arXiv:2301.06627.
[11] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz,
     D. Roth, Recent advances in natural language processing via large pre-trained language
     models: A survey, ACM Comput. Surv. (2023). URL: https://doi.org/10.1145/3605943.
     doi:10.1145/3605943.
[12] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, X. Wu, Unifying large language models and
     knowledge graphs: A roadmap, 2023. arXiv:2306.08302.
[13] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
     Exploring the limits of transfer learning with a unified text-to-text transformer, The
     Journal of Machine Learning Research 21 (2020) 5485–5551.
[14] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le,
     Finetuned language models are zero-shot learners, arXiv preprint arXiv:2109.01652 (2021).
[15] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. De-
     hghani, S. Brahma, et al., Scaling instruction-finetuned language models, arXiv preprint
     arXiv:2210.11416 (2022).
[16] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph,
     J. Wei, et al., The flan collection: Designing data and methods for effective instruction
     tuning, arXiv preprint arXiv:2301.13688 (2023).
[17] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière,
     N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language
     models, arXiv preprint arXiv:2302.13971 (2023).
[18] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
     P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models,
     arXiv preprint arXiv:2307.09288 (2023).

</pre>