Working with Event Data

Event data, usually recorded in so-called event logs, are the primary source of data of any process mining project and/or algorithm. As such, they play a vital role in process mining.

Within pm4py we distinguish between two major event data object types:

  • Event Logs: An event log represents a single sequence of events, executed in the context of a process. Multiple events at different indices of the event log potentially refer to the same process instance. The Event Log object in pm4py resembles the finite version of an event stream.
  • Trace Logs: A trace log is a collection of sequences of events. Each sequence in a trace log represents the execution of a process instance (typically these events describe the same case id). The trace log object is equivalent to the notion of event logs, commonly used in process mining.

In the remainder of this section, we describe how pm4py supports access and manipulation of event log data through the IEEE XES- and csv format.

Importing IEEE XES files

IEEE XES is a standard format in which Process Mining logs are expressed. For further information about the format, please study the IEEE Website.

The following example code aims to import a log, given a path to a log file.

from pm4py.objects.log.importer.xes import factory as xes_import_factory
log = xes_import_factory.apply("<path_to_xes_file>")

A fully working version of the example script can be found in the pm4py-source project, in the examples/web/ file.

The IEEE XES log is imported as a trace log, hence, the events are already grouped in traces. Trace logs are stored as an extension of the Python list: to access a given trace in the log, it is enough to provide its index in the log. Consider the following examples of how to access the different objects stored in the imported trace log:

  • log[0] refers to the first trace in the log
    • log[0][0] refers to the first event of the first trace in the log
      • log[0][1] refers to the second event of the first trace in the log
  • log[1] refers to the second trace in the log
    • log[1][0] refers to the first event of the second case in the log
    • log[1][1] refers to the second event of the second case in the log

The apply method of the xes_import_factory, i.e. located in pm4py.objects.log.importer.xes.factory file, contains two additional parameters:

  • variant: allows us to select a specific importer variant.
  • parameters: allows us to specify additional parameters to the underlying importer.

Observe that throughout pm4py, we often use the notions of factories, which contain an apply method that takes some objects as input, an optional parameters object and an optional variant object. The parameters object is always a dictionary that contains the parameters in a key-value fashion. The variant is typically a strng-valued argument.

Currently, we support two different variants, with corresponding (different) parameters:

  • iterparse: (default, use variant=”iterparse” to invoke) uses the iterparse library internally for xml parsing. Complies with the IEEE XES standard. Supported parameters:
    • timestamp_sort: (boolean) Specify if we should sort log by timestamp
    • timestamp_key: (string) If timestamp_sort is true, then sort the log by using this event-attribute key
    • reverse_sort: (boolean) Specify in which direction the log should be sorted
    • index_trace_indexes: (boolean) Specify if trace indexes should be added as event attribute for each event
    • max_no_traces_to_import: (integer) Specify the maximum number of traces to import from the log (as occurring in order of the XML file)
  • nonstandard: (use variant=”nonstandard” to invoke) custom implementation that reads the XES file in a line-by-line manner (for improved performance). It does not follow the standard and is able to import traces, simple trace attributes, events, and simple event attributes. Supported parameters:
    • same as iterparse

It is possible to access a specific value of a trace / event attribute, as example (the “concept:name”-key at trace level represents the case ID, at event level it typically represents the performed activity):

first_trace_concept_name = log[0].attributes["concept:name"]
first_event_first_trace_concept_name = log[0][0]["concept:name"]

The following code iterates over all the traces in the log writing the case id and, for each event, the performed activity:

for case_index, case in enumerate(log):
    print("\n case index: %d  case id: %s" % (case_index, case.attributes["concept:name"]))
    for event_index, event in enumerate(case):
        print("event index: %d  event activity: %s" % (event_index, event["concept:name"]))
 case index: 4  case id: 5
event index: 0  event activity: register request
event index: 1  event activity: examine casually
event index: 2  event activity: check ticket
event index: 3  event activity: decide
event index: 4  event activity: reinitiate request
event index: 5  event activity: check ticket
event index: 6  event activity: examine casually
event index: 7  event activity: decide
event index: 8  event activity: reinitiate request
event index: 9  event activity: examine casually
event index: 10  event activity: check ticket
event index: 11  event activity: decide
event index: 12  event activity: reject request

An example of invoking the non-standard variant, along with the specification of the timestamp_sort parameter, is contained in the following code:

import os
from pm4py.objects.log.importer.xes import factory as xes_import_factory

parameters = {"timestamp_sort": True}

log = xes_import_factory.apply("<path_to_xes_file>", variant="nonstandard", parameters=parameters)

Exporting IEEE XES files

Exporting takes as input a trace log and produces an XML that is eventually being saved into a file.

To export a trace log into a file exportedLog.xes, the following code could be used:

from pm4py.objects.log.exporter.xes import factory as xes_exporter

xes_exporter.export_log(log, "exportedLog.xes")

Importing logs from CSV files

CSV is a tabular format often used to store event logs. Excluding the first row, which describes the headers, each row in the CSV file corresponds to an event. Events in a CSV are not grouped in traces: a grouping should be made specifying a column as case ID and then events that share the same column value are grouped in the same case.

Process Mining algorithms implemented in pm4py usually take a trace log as input. The logical steps in order to get a trace log from a CSV file are:

  • Using Pandas to ingest the CSV file into a Dataframe
  • Converting Dataframe structure into the event log structure
  • Converting the event log into a trace log specifying the column corresponding to the case ID

In the following piece of code, the CSV file running-example.csv that can be found in the directory tests/input_data is imported into an event log structure:

import os
from pm4py.objects.log.importer.csv import factory as csv_importer

event_log = csv_importer.import_log(os.path.join("tests", "input_data", "running-example.csv"))

The previous code covers both the importing of the CSV through Pandas and its conversion into the event log structure. Additional parameters for the import_log method inside a dictionary passed as optional parameters argument:

  • sep expresses the delimiter (comma is the default)
  • quotechar expresses the quote character used in the CSV
  • nrows is a limit to the number of rows that should be read from the CSV file
  • sort expresses if the log should be sorted according to the values of the field specified in sort_field (usually, the timestamp)
  • insert_event_indexes is a boolean value that tells if an additional attribute should be inserted in the events

In an event log structure, events are not grouped in cases, so retrieving the length of an event log means retrieving the number of events. Moreover, each event in an event log is saved as a dictionary where the keys are the column names:

event_log_length = len(event_log)
for event in event_log:

In particular, this is an event of the running-example.csv log:

{'Unnamed: 0': 10, 'Activity': 'check ticket', 'Costs': 100, 'Resource': 'Mike', 'case:concept:name': 2, 'case:creator': 'Fluxicon Nitro', 'concept:name': 'check ticket', 'org:resource': 'Mike', 'time:timestamp': Timestamp('2010-12-30 11:12:00')}

To eventually convert the event log structure into a trace log structure (where events are grouped in cases), the case ID column must be identified by the user (in the previous example, the case ID column is caseconceptname). To operate the conversion, the following instructions could be provided:

from pm4py.objects.log import transform

trace_log = transform.transform_event_log_to_trace_log(event_log, case_glue="case:concept:name")

Sometimes is useful to ingest the CSV into a dataframe using Pandas, operating some pre-filtering on the dataframe, and after that converting it into an event log (and then trace log) structure. The following code covers the ingestion, the conversion into event log structure and eventually the conversion into trace log.

import os
from pm4py.objects.log.adapters.pandas import csv_import_adapter
from pm4py.objects.log.importer.csv.versions import pandas_df_imp
from pm4py.objects.log import transform

dataframe = csv_import_adapter.import_dataframe_from_path(os.path.join("tests", "input_data", "running-example.csv"), sep=",")
event_log = pandas_df_imp.convert_dataframe_to_event_log(dataframe)
trace_log = transform.transform_event_log_to_trace_log(event_log, case_glue="case:concept:name")

Exporting logs to CSV files

Exporting capabilities into CSV files are provided for both event log and trace log formats.

The following example covers exporting of event logs into CSV. Hereby, the event log structure is converted into a Pandas dataframe, which is then exported to a CSV file:

from pm4py.objects.log.exporter.csv import factory as csv_exporter

csv_exporter.export_log(event_log, "outputFile1.csv")

The exporting of trace logs into CSV is a similar matter. The trace log is converted into an event log (the case attributes are reported into the event adding the case: prefix to them), then the event log structure is converted into a Pandas dataframe and the dataframe is exported to a CSV file

from pm4py.objects.log.exporter.csv import factory as csv_exporter

csv_exporter.export_log(trace_log, "outputFile2.csv")

Sorting by timestamp

If an event log or a trace log object is not sorted accordingly to the timestamp of its events, then it is possible to sort the log using the following instructions:

from pm4py.objects.log.util import sorting
log = sorting.sort_timestamp(log)

In the case of event logs, events are simply sorted from the first (according to the timestamp) to the last.

In the case of trace logs, traces are first internally sorted by the timestamp of their events, and then they are sorted accordingly to the timestamp of the first event.

Sorting by lambda expression

If the sorting operation shall instead be done by using lambda expressions, then it is possible to use the method sorting.sort_lambda providing the regular expression as second argument, and the direction of sort (if reverse=False, then the sort is ascending; if reverse=True, then the sort is descending). The following example sorts the log by sorting on the case ID.

from pm4py.objects.log.util import sorting

sorted_log = sorting.sort_lambda(log, lambda x: x.attributes["concept:name"], reverse=False)


This operation, that works for trace logs objects, consists of taking only a small subset of traces, with the hope they reflect the overall behavior of the process.

This is particularly useful if the log size is huge.

The following code is useful to keep only 50 traces of the log:

from pm4py.objects.log.util import sampling
sampled_log = sampling.sample(log, n=50)