=Paper=
{{Paper
|id=Vol-3758/paper-24
|storemode=property
|title=CSV-PM-LLM-Parsing: Automatic Ingestion of CSV Event Logs for Process Mining using LLMs
|pdfUrl=https://ceur-ws.org/Vol-3758/paper-24.pdf
|volume=Vol-3758
|authors=Alessandro Berti,Wil M.P. van der Aalst
|dblpUrl=https://dblp.org/rec/conf/bpm/BertiA24
}}
==CSV-PM-LLM-Parsing: Automatic Ingestion of CSV Event Logs for Process Mining using LLMs==
<pdf width="1500px">https://ceur-ws.org/Vol-3758/paper-24.pdf</pdf>
<pre>
                         CSV-PM-LLM-Parsing: Automatic Ingestion of CSV Event
                         Logs for Process Mining using LLMs
                         Alessandro Berti1,2,* , Wil M.P. van der Aalst1,2
                         1
                             Process and Data Science Chair, RWTH Aachen University, Aachen, Germany
                         2
                             Fraunhofer FIT, Sankt Augustin, Germany


                                       Abstract
                                       This demo paper introduces CSV-PM-LLM-Parsing, a Python library to automatically transform unstructured
                                       event logs contained in CSV files into process mining event logs by using Large Language Models (LLMs). In
                                       particular, we implement modules for three problems: 1) identification of the separator and quote character, 2)
                                       selection of the essential process mining columns, and 3) timestamp format detection. Our library is compatible
                                       with any LLM supporting the OpenAI’s API specification. Our experiment shows that some LLMs (such as
                                       gpt-4o or the open-source Qwen/Qwen2-72B-Instruct) can effectively tackle all three problems. The source code is
                                       publicly available in the Git repository https://github.com/fit-alessandro-berti/csv-pm-llm-parsing.

                                       Keywords
                                       Object-Centric Anomaly Detection, Object-Centric Feature Extraction, Procurement Processes, Large Language
                                       Models


                            Process mining is a branch of data science aiming to infer process-related insights from the event data
                         recorded by the information systems supporting the execution of such processes. The most important
                         artifact in process mining is the event log, which contains a collection of events extracted from the
                         information systems and allows the application of techniques such as process discovery, conformance
                         checking, model enhancement, and predictive analytics. Standard formats have been proposed for
                         process mining, including XES [1] for traditional event logs (in which an event is related to a single
                         case) and OCEL 2.0 [2] for object-centric event logs (in which an event is related to different objects
                         of different object types). However, in practice, CSV files are often the main data source for process
                         mining, leading to challenges in the detection of the encoding, fundamental parameters (separator and
                         quote character), challenges in the identification of the main columns (case identifier, activity, and
                         timestamp), and challenges in the detection of the timestamp format.
                            In this demo paper, we propose a Python library to overcome the challenges of importing a CSV
                         for process mining purposes using Large Language Models (LLMs). LLMs, being trained on a vast
                         corpus of data, can resolve the aforementioned challenges. LLMs have been successfully applied to
                         other process mining contexts requiring domain knowledge [3, 4], and can be seen as a zero-shot [5] or
                         few-shot [6] executors for mainstream tasks. The energy consumption of LLMs is concerning and poses
                         sustainability challenges [7], which may discourage their application for tasks that can be resolved with
                         other techniques. However, their application is justified for problems requiring domain knowledge,
                         which also include CSV parsing for process mining purposes.


                         1. Modules of the Library
                         In this section, we introduce the three modules of the library. First, we explain the LLM-based approach
                         to identify the separator and quote characters of the CSV file. Then, we describe the LLM-based
                         identification of the case identifier, activity, and timestamp columns. Finally, we discuss the automatic
                         detection of the timestamp column format.

                         Proceedings of the Best BPM Dissertation Award, Doctoral Consortium, and Demonstrations & Resources Forum co-located with
                         22nd International Conference on Business Process Management (BPM 2024), Krakow, Poland, September 1st to 6th, 2024.
                         *
                           Corresponding author.
                         $ a.berti@pads.rwth-aachen.de (A. Berti); wvdaalst@pads.rwth-aachen.de (W. M.P. v. d. Aalst)
                          0000-0002-3279-4795 (A. Berti); 0000-0002-0955-6940 (W. M.P. v. d. Aalst)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
             Listing 1: Prompt for the identification of the case identifier, activity, and timestamp column in a CSV
             file.
Given the dataframe with the following columns:

Data columns (total 5 columns):
 # Column Non−Null Count Dtype
−−− −−−−−− −−−−−−−−−−−−−− −−−−−
 0 case_id 6 non−null int64
 1 task 6 non−null object
 2 completion_time 6 non−null object
 3 resource 6 non−null object
 4 cost 6 non−null int64
dtypes: int64(2), object(3)


Can you suggest some columns for the case identifier, activity, and completion timestamp?
Please produce a JSON containing as keys: ’caseid’, ’activity’, ’timestamp’
Each key should be associated with the name of the column.


1.1. Identification of the Separator and the Quote Character
Identification of the separator and the quote character is critical for accurate CSV parsing. The separator
distinguishes different fields within a row, while the quote character allows the inclusion of separators
within field values. Misidentifying these characters can lead to incorrect data parsing, causing errors in
subsequent data processing tasks.
   For humans, identifying the separator and quote character can be challenging due to the variability
in CSV formats. Common separators include commas, semicolons, tabs, and spaces, while common
quote characters include double quotes and single quotes. Users may need to examine the CSV file
closely to determine which characters are used, a task that can be time-consuming and prone to error,
especially with complex datasets.
   Implementing an automatic detector, or sniffer, for these characters involves several challenges. The
sniffer must accurately analyze the CSV content to infer the correct the separator and quote characters
without human input. This requires robust algorithms capable of handling diverse formats and edge
cases, such as fields containing separators or quote characters within quoted strings.
   In our library, we provide the initial characters of the CSV file to the LLM to identify the separator
and quote character. These are used to parse the CSV file into a Pandas dataframe. If the parsing fails,
the prompt is executed again with the recorded error until valid separator and quote characters are
provided.
   However, in some situations, no suitable separator and quote character could be discovered because
the CSV is malformed. In that situation, the parsing can be attempted by ignoring the malformed rows.

1.2. Identification of the Main Attributes (Case Identifier, Activity, and Timestamp)
Detecting which columns correspond to the case identifier, activity, and timestamp when ingesting a
CSV for process mining is crucial because these elements are the foundation for creating an event log.
The case identifier groups events into specific instances of a process, the activity column specifies the
actions taken, and the timestamp provides the chronological order of events.
   End users may face several challenges in accurately identifying the case identifier, activity, and
timestamp columns in a CSV file. One common difficulty is the variability in naming conventions,
where column names may not be intuitive or standardized, making it hard to discern their purposes.
Additionally, users might not have domain expertise, leading to confusion about which columns should
serve as case identifiers or timestamps, especially if the data includes multiple date or ID fields.
   Current automated techniques to detect the case identifier, activity, and timestamp columns in a CSV
file for process mining primarily rely on two approaches. The first approach uses regular expression
matching to identify these columns based on their names, attempting to match patterns that suggest
they represent case IDs, activities, or timestamps. For the second category, different techniques have
been proposed in the literature. For example, [8] and [9] focus on the identification of the case identifier
column based on statistical techniques. In [10] and [11], deep-learning-based techniques have been
used to detect the case identifier, activity, and timestamp columns. However, the computational cost of
these models is high. In [12], a meta-model is built starting from a relational database that allows the
discovery/recommendation of a case notion. However, the technique is not applicable to standard CSVs.
   In our library, we provide the name and data type of the columns to the LLM, requesting identification
of the case identifier, activity, and timestamp columns (a corresponding prompt is shown in Fig. 1). If
the recognition fails, the prompt is executed again until a valid combination of columns is identified.

1.3. Timestamp Format Detection
In process mining, converting timestamp strings to timestamp objects is essential because it ensures
accurate temporal analysis and sequencing of events within the process. This conversion allows for the
calculation of durations, intervals, and the order of activities, which are critical for identifying process
inefficiencies and optimizing workflows.
   However, end users often struggle to specify the correct format of timestamp columns due to variations
in date and time representations, leading to errors in data interpretation.
   Implementing automated approaches to address this challenge is also difficult, as it requires robust
algorithms capable of correctly identifying and parsing a wide range of timestamp formats, accounting
for regional differences, inconsistencies, and potential ambiguities in the data. For instance, libraries
like dateutil in Python and Java’s java.time package provide built-in functions to parse multiple date
and time formats. However, they fail on non-common formats.
   In our library, we provide the top values of the timestamp column to the LLM, which proposes a
format. The timestamp column is then parsed according to this proposed format. If parsing fails, the
resulting error is given to the LLM, which then proposes an alternative format.
   The proposed approach fails when the timestamp column contains values of mixed types (for example,
some values follow ISO8601, and some follow RFC1123). In that case, a line-by-line parsing needs to be
adopted, which is impractical on LLMs.


2. Installation
The library can be installed in Python ≥ 3.9.x executing the command pip install -U csv-pm-
llm-parsing. The source code is publicly available in the Git repository https://github.com/
fit-alessandro-berti/csv-pm-llm-parsing. The library can use as backend LLM any advanced LLM
exposing the OpenAI’s API specification.
   To setup the connection to the LLM, the openai_api_url, openai_api_key, and openai_model
should be provided to the methods of the library. Alternatively, the parameters could be set up in the
system environment variables OPENAI_API_URL, OPENAI_API_KEY, and OPENAI_MODEL.
   Some example settings are provided:

       • OpenAI’s GPT-4O:
          openai_api_url = ’https://api.openai.com/v1’; openai_api_key = ’replace’; openai_model
          = ’gpt-4o’;

       • Locally run Ollama1 :
         openai_api_url = ’http://127.0.0.1:11434/v1’; openai_api_key = ’replace’; openai_model
         = ’qwen2:7b-instruct-q6_K’;

       • DeepInfra’s open-source model2 :
         openai_api_url = ’https://api.deepinfra.com/v1/openai/’; openai_api_key = ’replace’;
         openai_model = ’Qwen/Qwen2-72B-Instruct’;


1
    https://ollama.com/library/qwen2:7b-instruct-q6_K
2
    https://deepinfra.com/Qwen/Qwen2-72B-Instruct
3. Initial Experiments
We generated some synthetic datasets that can be used to test the library. In particular:
    • https://github.com/fit-alessandro-berti/csv-pm-llm-parsing/tree/main/testfiles/sep_detection
      contains some examples of common choices for separator and quote character (i.e., command
      and double quote, semicolon and single quote, tab and single quote, . . . ).

    • https://github.com/fit-alessandro-berti/csv-pm-llm-parsing/tree/main/testfiles/cid_acti_timest
      contains some test cases for the selection of the case identifier, activity, and timestamp columns.

    • https://github.com/fit-alessandro-berti/csv-pm-llm-parsing/tree/main/testfiles/timest_format
      contains some test cases for the identification of the timestamp format (containing formats such
      as ISO8601, UNIX, RFC1123, . . . ).

    • https://github.com/fit-alessandro-berti/csv-pm-llm-parsing/tree/main/testfiles/overall contains
      more complex test cases to test the overall functioning of the library.
We propose in the folder
https://github.com/fit-alessandro-berti/csv-pm-llm-parsing/tree/main/examples some examples that
use the library for every one of the provided test cases.
   In general, we found that larger LLMs (such as gpt-4o and Qwen/Qwen2-72B-Instruct) can execute the
tasks satisfactorily, while smaller LLMs (such as qwen2:7b-instruct-q6_K), while still competent, fail on
different test cases.
   A video showcasing the library and its functioning on some test cases is available at https://youtu.be/
L21sapgCzbM.


References
 [1] M. T. Wynn, W. M. P. van der Aalst, E. Verbeek, B. N. D. Stefano, The IEEE XES standard for
     process mining: Experiences, adoption, and revision [society briefs], IEEE Comput. Intell. Mag. 19
     (2024) 20–23.
 [2] I. Koren, N. Adams, A. Berti, OCEL 2.0 resources - www.ocel-standard.org, CoRR abs/2403.01982
     (2024). arXiv:2403.01982.
 [3] A. Berti, D. Schuster, W. M. P. van der Aalst, Abstractions, scenarios, and prompt definitions for
     process mining with llms: A case study, in: BPM 2023 Workshops, volume 492, Springer, 2023, pp.
     427–439.
 [4] A. Berti, H. Kourani, H. Hafke, C. Yun-Li, D. Schuster, Evaluating Large Language Models in
     Process Mining: Capabilities, Benchmarks, Evaluation Strategies, and Future Challenges, in:
     Proceedings of the BPM-DS 2024 Working Conference, Springer, 2024.
 [5] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners,
     in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), NeurIPS 2022, 2022.
 [6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. K. et al., Language models are few-shot learners, in:
     H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information
     Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020,
     NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL: https://proceedings.neurips.cc/paper/2020/
     hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
 [7] A. S. Luccioni, S. Viguier, A. Ligozat, Estimating the carbon footprint of bloom, a 176b parameter
     language model, J. Mach. Learn. Res. 24 (2023) 253:1–253:15. URL: http://jmlr.org/papers/v24/
     23-0069.html.
 [8] A. A. Andaloussi, A. Burattin, B. Weber, Toward an automated labeling of event log attributes, in:
     J. Gulden, I. Reinhartz-Berger, R. Schmidt, S. Guerreiro, W. Guédria, P. Bera (Eds.), BPMDS 2018
     and EMMSAD 2018, Held at CAiSE 2018, Proceedings, volume 318 of Lecture Notes in Business
     Information Processing, Springer, 2018, pp. 82–96.
 [9] A. Burattin, R. Vigo, A framework for semi-automated process instance discovery from decorative
     attributes, in: CIDM 2011 Proceedings, IEEE, 2011, pp. 176–183.
[10] S. Sim, R. A. Sutrisnowati, S. Won, S. Lee, H. Bae, Automatic conversion of event data to event
     logs using CNN and event density embedding, IEEE Access 10 (2022) 15994–16009.
[11] K. Toyoda, R. G. K. Ying, A. N. Zhang, P. S. Tan, Identifying the key attributes in an unlabeled
     event log for automated process discovery, IEEE Trans. Serv. Comput. 17 (2024) 74–81.
[12] E. G. L. de Murillas, H. A. Reijers, W. M. P. van der Aalst, Case notion discovery and recommen-
     dation: automated event log building on databases, Knowl. Inf. Syst. 62 (2020) 2539–2575. URL:
     https://doi.org/10.1007/s10115-019-01430-6.

</pre>