<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Corresponding author.
$ a.berti@pads.rwth-aachen.de (A. Berti); wvdaalst@pads.rwth-aachen.de (W. M.P. v. d. Aalst)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>CSV-PM-LLM-Parsing: Automatic Ingestion of CSV Event Logs for Process Mining using LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Berti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wil M.P. van der Aalst</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer FIT</institution>
          ,
          <addr-line>Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Process and Data Science Chair, RWTH Aachen University</institution>
          ,
          <addr-line>Aachen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This demo paper introduces CSV-PM-LLM-Parsing, a Python library to automatically transform unstructured event logs contained in CSV files into process mining event logs by using Large Language Models (LLMs). In particular, we implement modules for three problems: 1) identification of the separator and quote character, 2) selection of the essential process mining columns, and 3) timestamp format detection. Our library is compatible with any LLM supporting the OpenAI's API specification. Our experiment shows that some LLMs (such as gpt-4o or the open-source Qwen/Qwen2-72B-Instruct) can efectively tackle all three problems. The source code is publicly available in the Git repository https://github.com/fit-alessandro-berti/csv-pm-llm-parsing.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Object-Centric Anomaly Detection</kwd>
        <kwd>Object-Centric Feature Extraction</kwd>
        <kwd>Procurement Processes</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Modules of the Library</title>
      <p>In this section, we introduce the three modules of the library. First, we explain the LLM-based approach
to identify the separator and quote characters of the CSV file. Then, we describe the LLM-based
identification of the case identifier, activity, and timestamp columns. Finally, we discuss the automatic
detection of the timestamp column format.</p>
      <p>Listing 1: Prompt for the identification of the case identifier, activity, and timestamp column in a CSV
file.</p>
      <p>
        Given the dataframe with the following columns:
Data columns (total 5 columns):
# Column Non−Null Count Dtype
−−− −−−−−− −−−−−−−−−−−−−− −−−−−
0 case_id 6 non−null int64
1 task 6 non−null object
2 completion_time 6 non−null object
3 resource 6 non−null object
4 cost 6 non−null int64
dtypes: int64(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), object(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
Can you suggest some columns for the case identifier, activity, and completion timestamp?
Please produce a JSON containing as keys: ’caseid’, ’activity’, ’timestamp’
Each key should be associated with the name of the column.
      </p>
      <sec id="sec-1-1">
        <title>1.1. Identification of the Separator and the Quote Character</title>
        <p>Identification of the separator and the quote character is critical for accurate CSV parsing. The separator
distinguishes diferent fields within a row, while the quote character allows the inclusion of separators
within field values. Misidentifying these characters can lead to incorrect data parsing, causing errors in
subsequent data processing tasks.</p>
        <p>For humans, identifying the separator and quote character can be challenging due to the variability
in CSV formats. Common separators include commas, semicolons, tabs, and spaces, while common
quote characters include double quotes and single quotes. Users may need to examine the CSV file
closely to determine which characters are used, a task that can be time-consuming and prone to error,
especially with complex datasets.</p>
        <p>Implementing an automatic detector, or snifer , for these characters involves several challenges. The
snifer must accurately analyze the CSV content to infer the correct the separator and quote characters
without human input. This requires robust algorithms capable of handling diverse formats and edge
cases, such as fields containing separators or quote characters within quoted strings.</p>
        <p>In our library, we provide the initial characters of the CSV file to the LLM to identify the separator
and quote character. These are used to parse the CSV file into a Pandas dataframe. If the parsing fails,
the prompt is executed again with the recorded error until valid separator and quote characters are
provided.</p>
        <p>However, in some situations, no suitable separator and quote character could be discovered because
the CSV is malformed. In that situation, the parsing can be attempted by ignoring the malformed rows.</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Identification of the Main Attributes (Case Identifier, Activity, and Timestamp)</title>
        <p>Detecting which columns correspond to the case identifier, activity, and timestamp when ingesting a
CSV for process mining is crucial because these elements are the foundation for creating an event log.
The case identifier groups events into specific instances of a process, the activity column specifies the
actions taken, and the timestamp provides the chronological order of events.</p>
        <p>End users may face several challenges in accurately identifying the case identifier, activity, and
timestamp columns in a CSV file. One common dificulty is the variability in naming conventions,
where column names may not be intuitive or standardized, making it hard to discern their purposes.
Additionally, users might not have domain expertise, leading to confusion about which columns should
serve as case identifiers or timestamps, especially if the data includes multiple date or ID fields.</p>
        <p>Current automated techniques to detect the case identifier, activity, and timestamp columns in a CSV
ifle for process mining primarily rely on two approaches. The first approach uses regular expression
matching to identify these columns based on their names, attempting to match patterns that suggest
they represent case IDs, activities, or timestamps. For the second category, diferent techniques have
been proposed in the literature. For example, [8] and [9] focus on the identification of the case identifier
column based on statistical techniques. In [10] and [11], deep-learning-based techniques have been
used to detect the case identifier, activity, and timestamp columns. However, the computational cost of
these models is high. In [12], a meta-model is built starting from a relational database that allows the
discovery/recommendation of a case notion. However, the technique is not applicable to standard CSVs.</p>
        <p>In our library, we provide the name and data type of the columns to the LLM, requesting identification
of the case identifier, activity, and timestamp columns (a corresponding prompt is shown in Fig. 1). If
the recognition fails, the prompt is executed again until a valid combination of columns is identified.</p>
      </sec>
      <sec id="sec-1-3">
        <title>1.3. Timestamp Format Detection</title>
        <p>In process mining, converting timestamp strings to timestamp objects is essential because it ensures
accurate temporal analysis and sequencing of events within the process. This conversion allows for the
calculation of durations, intervals, and the order of activities, which are critical for identifying process
ineficiencies and optimizing workflows.</p>
        <p>However, end users often struggle to specify the correct format of timestamp columns due to variations
in date and time representations, leading to errors in data interpretation.</p>
        <p>Implementing automated approaches to address this challenge is also dificult, as it requires robust
algorithms capable of correctly identifying and parsing a wide range of timestamp formats, accounting
for regional diferences, inconsistencies, and potential ambiguities in the data. For instance, libraries
like dateutil in Python and Java’s java.time package provide built-in functions to parse multiple date
and time formats. However, they fail on non-common formats.</p>
        <p>In our library, we provide the top values of the timestamp column to the LLM, which proposes a
format. The timestamp column is then parsed according to this proposed format. If parsing fails, the
resulting error is given to the LLM, which then proposes an alternative format.</p>
        <p>The proposed approach fails when the timestamp column contains values of mixed types (for example,
some values follow ISO8601, and some follow RFC1123). In that case, a line-by-line parsing needs to be
adopted, which is impractical on LLMs.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Installation</title>
      <p>The library can be installed in Python ≥ 3.9.x executing the command pip install -U
csv-pmllm-parsing. The source code is publicly available in the Git repository https://github.com/
ift-alessandro-berti/csv-pm-llm-parsing. The library can use as backend LLM any advanced LLM
exposing the OpenAI’s API specification.</p>
      <p>To setup the connection to the LLM, the openai_api_url, openai_api_key, and openai_model
should be provided to the methods of the library. Alternatively, the parameters could be set up in the
system environment variables OPENAI_API_URL, OPENAI_API_KEY, and OPENAI_MODEL.</p>
      <p>Some example settings are provided:
• OpenAI’s GPT-4O:
openai_api_url = ’https://api.openai.com/v1’; openai_api_key = ’replace’; openai_model
= ’gpt-4o’;
• Locally run Ollama1:
openai_api_url = ’http://127.0.0.1:11434/v1’; openai_api_key = ’replace’; openai_model
= ’qwen2:7b-instruct-q6_K’;
• DeepInfra’s open-source model2:
openai_api_url = ’https://api.deepinfra.com/v1/openai/’; openai_api_key = ’replace’;
openai_model = ’Qwen/Qwen2-72B-Instruct’;</p>
      <sec id="sec-2-1">
        <title>1https://ollama.com/library/qwen2:7b-instruct-q6_K 2https://deepinfra.com/Qwen/Qwen2-72B-Instruct</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Initial Experiments</title>
      <p>We generated some synthetic datasets that can be used to test the library. In particular:
• https://github.com/fit-alessandro-berti/csv-pm-llm-parsing/tree/main/testfiles/sep_detection
contains some examples of common choices for separator and quote character (i.e., command
and double quote, semicolon and single quote, tab and single quote, . . . ).
• https://github.com/fit-alessandro-berti/csv-pm-llm-parsing/tree/main/testfiles/cid_acti_timest
contains some test cases for the selection of the case identifier, activity, and timestamp columns.
• https://github.com/fit-alessandro-berti/csv-pm-llm-parsing/tree/main/testfiles/timest_format
contains some test cases for the identification of the timestamp format (containing formats such
as ISO8601, UNIX, RFC1123, . . . ).
• https://github.com/fit-alessandro-berti/csv-pm-llm-parsing/tree/main/testfiles/overall contains
more complex test cases to test the overall functioning of the library.</p>
      <sec id="sec-3-1">
        <title>We propose in the folder</title>
        <p>https://github.com/fit-alessandro-berti/csv-pm-llm-parsing/tree/main/examples some examples that
use the library for every one of the provided test cases.</p>
        <p>In general, we found that larger LLMs (such as gpt-4o and Qwen/Qwen2-72B-Instruct) can execute the
tasks satisfactorily, while smaller LLMs (such as qwen2:7b-instruct-q6_K), while still competent, fail on
diferent test cases.</p>
        <p>A video showcasing the library and its functioning on some test cases is available at https://youtu.be/
L21sapgCzbM.
[9] A. Burattin, R. Vigo, A framework for semi-automated process instance discovery from decorative
attributes, in: CIDM 2011 Proceedings, IEEE, 2011, pp. 176–183.
[10] S. Sim, R. A. Sutrisnowati, S. Won, S. Lee, H. Bae, Automatic conversion of event data to event
logs using CNN and event density embedding, IEEE Access 10 (2022) 15994–16009.
[11] K. Toyoda, R. G. K. Ying, A. N. Zhang, P. S. Tan, Identifying the key attributes in an unlabeled
event log for automated process discovery, IEEE Trans. Serv. Comput. 17 (2024) 74–81.
[12] E. G. L. de Murillas, H. A. Reijers, W. M. P. van der Aalst, Case notion discovery and
recommendation: automated event log building on databases, Knowl. Inf. Syst. 62 (2020) 2539–2575. URL:
https://doi.org/10.1007/s10115-019-01430-6.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Wynn</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
            , E. Verbeek,
            <given-names>B. N. D.</given-names>
          </string-name>
          <string-name>
            <surname>Stefano</surname>
          </string-name>
          ,
          <article-title>The IEEE XES standard for process mining: Experiences, adoption</article-title>
          , and revision [society briefs],
          <source>IEEE Comput. Intell. Mag</source>
          .
          <volume>19</volume>
          (
          <year>2024</year>
          )
          <fpage>20</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Koren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Berti</surname>
          </string-name>
          , OCEL
          <volume>2</volume>
          .
          <article-title>0 resources - www.ocel-standard</article-title>
          .org,
          <source>CoRR abs/2403</source>
          .
          <year>01982</year>
          (
          <year>2024</year>
          ).
          <source>arXiv:2403</source>
          .
          <year>01982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          , Abstractions, scenarios, and
          <article-title>prompt definitions for process mining with llms: A case study</article-title>
          ,
          <source>in: BPM 2023 Workshops</source>
          , volume
          <volume>492</volume>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>427</fpage>
          -
          <lpage>439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kourani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hafke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yun-Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuster</surname>
          </string-name>
          , Evaluating Large Language Models in Process Mining: Capabilities, Benchmarks,
          <string-name>
            <given-names>Evaluation</given-names>
            <surname>Strategies</surname>
          </string-name>
          , and Future Challenges,
          <source>in: Proceedings of the BPM-DS 2024 Working Conference</source>
          , Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
          <article-title>Large language models are zero-shot reasoners</article-title>
          , in: S. Koyejo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Oh (Eds.),
          <source>NeurIPS</source>
          <year>2022</year>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J. K.</given-names>
          </string-name>
          et al.,
          <article-title>Language models are few-shot learners</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems</source>
          <year>2020</year>
          ,
          <article-title>NeurIPS 2020</article-title>
          , December 6-
          <issue>12</issue>
          ,
          <year>2020</year>
          , virtual,
          <year>2020</year>
          . URL: https://proceedings.neurips.cc/paper/2020/ hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Luccioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Viguier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ligozat</surname>
          </string-name>
          ,
          <article-title>Estimating the carbon footprint of bloom, a 176b parameter language model</article-title>
          ,
          <source>J. Mach. Learn. Res</source>
          .
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <volume>253</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>253</lpage>
          :
          <fpage>15</fpage>
          . URL: http://jmlr.org/papers/v24/
          <fpage>23</fpage>
          -
          <lpage>0069</lpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Andaloussi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Burattin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <article-title>Toward an automated labeling of event log attributes</article-title>
          , in: J.
          <string-name>
            <surname>Gulden</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Reinhartz-Berger</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Guerreiro</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Guédria</surname>
          </string-name>
          , P. Bera (Eds.),
          <source>BPMDS 2018 and EMMSAD</source>
          <year>2018</year>
          ,
          <article-title>Held at CAiSE 2018</article-title>
          , Proceedings, volume
          <volume>318</volume>
          <source>of Lecture Notes in Business Information Processing</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>