Working with Event Data

Event data, usually recorded in so-called logs, are the primary source of data of any process mining project and/or algorithm. As such, they play a vital role in process mining.

Within pm4py we distinguish between two major event data object types:

  • Event Streams: An event stream represents a single sequence of events, executed in the context of a process. Multiple events at different indices of the event stream potentially refer to the same process instance. The Event Stream object in pm4py resembles the finite version of an event stream.
  • Event Logs: A log is a collection of sequences of events. Each sequence in a log represents the execution of a process instance (typically these events describe the same case id). The log object is equivalent to the notion of event logs, commonly used in process mining.

In the remainder of this section, we describe how pm4py supports access and manipulation of event log data through the IEEE XES- and csv format.

Importing IEEE XES files

IEEE XES is a standard format in which Process Mining logs are expressed. For further information about the format, please study the IEEE Website.

The following example code aims to import a log, given a path to a log file.

from pm4py.objects.log.importer.xes import factory as xes_import_factory
log = xes_import_factory.apply("<path_to_xes_file>")

A fully working version of the example script can be found in the pm4py-source project, in the examples/web/import_xes_log.py file.

The IEEE XES log is imported as a log, hence, the events are already grouped in traces. Logs are stored as an extension of the Python list: to access a given trace in the log, it is enough to provide its index in the log. Consider the following examples of how to access the different objects stored in the imported log:

  • log[0] refers to the first trace in the log
    • log[0][0] refers to the first event of the first trace in the log
      • log[0][1] refers to the second event of the first trace in the log
  • log[1] refers to the second trace in the log
    • log[1][0] refers to the first event of the second case in the log
    • log[1][1] refers to the second event of the second case in the log

The apply method of the xes_import_factory, i.e. located in pm4py.objects.log.importer.xes.factory file, contains two additional parameters:

  • variant: allows us to select a specific importer variant.
  • parameters: allows us to specify additional parameters to the underlying importer.

Observe that throughout pm4py, we often use the notions of factories, which contain an apply method that takes some objects as input, an optional parameters object and an optional variant object. The parameters object is always a dictionary that contains the parameters in a key-value fashion. The variant is typically a strng-valued argument.

Currently, we support two different variants, with corresponding (different) parameters:

  • iterparse: (default, use variant=”iterparse” to invoke) uses the iterparse library internally for xml parsing. Complies with the IEEE XES standard. Supported parameters:
    • timestamp_sort: (boolean) Specify if we should sort log by timestamp
    • timestamp_key: (string) If timestamp_sort is true, then sort the log by using this event-attribute key
    • reverse_sort: (boolean) Specify in which direction the log should be sorted
    • index_trace_indexes: (boolean) Specify if trace indexes should be added as event attribute for each event
    • max_no_traces_to_import: (integer) Specify the maximum number of traces to import from the log (as occurring in order of the XML file)
  • nonstandard: (use variant=”nonstandard” to invoke) custom implementation that reads the XES file in a line-by-line manner (for improved performance). It does not follow the standard and is able to import traces, simple trace attributes, events, and simple event attributes. Supported parameters:
    • same as iterparse

It is possible to access a specific value of a trace / event attribute, as example (the “concept:name”-key at trace level represents the case ID, at event level it typically represents the performed activity):

first_trace_concept_name = log[0].attributes["concept:name"]
first_event_first_trace_concept_name = log[0][0]["concept:name"]

The following code iterates over all the traces in the log writing the case id and, for each event, the performed activity:

for case_index, case in enumerate(log):
    print("\n case index: %d  case id: %s" % (case_index, case.attributes["concept:name"]))
    for event_index, event in enumerate(case):
        print("event index: %d  event activity: %s" % (event_index, event["concept:name"]))
 case index: 4  case id: 5
event index: 0  event activity: register request
event index: 1  event activity: examine casually
event index: 2  event activity: check ticket
event index: 3  event activity: decide
event index: 4  event activity: reinitiate request
event index: 5  event activity: check ticket
event index: 6  event activity: examine casually
event index: 7  event activity: decide
event index: 8  event activity: reinitiate request
event index: 9  event activity: examine casually
event index: 10  event activity: check ticket
event index: 11  event activity: decide
event index: 12  event activity: reject request

An example of invoking the non-standard variant, along with the specification of the timestamp_sort parameter, is contained in the following code:

import os
from pm4py.objects.log.importer.xes import factory as xes_import_factory

parameters = {"timestamp_sort": True}

log = xes_import_factory.apply("<path_to_xes_file>", variant="nonstandard", parameters=parameters)

Exporting IEEE XES files

Exporting takes as input a log and produces an XML that is eventually being saved into a file.

To export a log into a file exportedLog.xes, the following code could be used:

from pm4py.objects.log.exporter.xes import factory as xes_exporter

xes_exporter.export_log(log, "exportedLog.xes")

Importing logs from CSV files

CSV is a tabular format often used to store event streams. Excluding the first row, which describes the headers, each row in the CSV file corresponds to an event. Events in a CSV are not grouped in traces: a grouping should be made specifying a column as case ID and then events that share the same column value are grouped in the same case.

Process Mining algorithms implemented in pm4py usually take a log as input. The logical steps in order to get a log from a CSV file are:

  • Using Pandas to ingest the CSV file into a Dataframe
  • Converting Dataframe structure into the event stream structure
  • Converting the event stream into a log specifying the column corresponding to the case ID

In the following piece of code, the CSV file running-example.csv that can be found in the directory tests/input_data is imported into an event stream structure:

import os
from pm4py.objects.log.importer.csv import factory as csv_importer

event_stream = csv_importer.import_event_stream(os.path.join("tests", "input_data", "running-example.csv"))

The previous code covers both the importing of the CSV through Pandas and its conversion into the event stream structure. Additional parameters for the import_log method inside a dictionary passed as optional parameters argument:

  • sep expresses the delimiter (comma is the default)
  • quotechar expresses the quote character used in the CSV
  • nrows is a limit to the number of rows that should be read from the CSV file
  • sort expresses if the log should be sorted according to the values of the field specified in sort_field (usually, the timestamp)
  • insert_event_indexes is a boolean value that tells if an additional attribute should be inserted in the events

In an event stream structure, events are not grouped in cases, so retrieving the length of an event stream means retrieving the number of events. Moreover, each event in an event stream is saved as a dictionary where the keys are the column names:

event_stream_length = len(event_stream)
print(event_stream_length)
for event in event_stream:
    print(event)

In particular, this is an event of the running-example.csv log:

{'Unnamed: 0': 10, 'Activity': 'check ticket', 'Costs': 100, 'Resource': 'Mike', 'case:concept:name': 2, 'case:creator': 'Fluxicon Nitro', 'concept:name': 'check ticket', 'org:resource': 'Mike', 'time:timestamp': Timestamp('2010-12-30 11:12:00')}

To eventually convert the event stream structure into a log structure (where events are grouped in cases), the case ID column must be identified by the user (in the previous example, the case ID column is caseconceptname). To operate the conversion, the following instructions could be provided:

from pm4py.objects.conversion.log import factory as conversion_factory

log = conversion_factory.apply(event_stream)

Sometimes is useful to ingest the CSV into a dataframe using Pandas, operating some pre-filtering on the dataframe, and after that converting it into an event stream (and then log) structure. The following code covers the ingestion, the conversion into event stream structure and eventually the conversion into log.

import os
from pm4py.objects.log.adapters.pandas import csv_import_adapter
from pm4py.objects.conversion.log import factory as conversion_factory

dataframe = csv_import_adapter.import_dataframe_from_path(os.path.join("tests", "input_data", "running-example.csv"), sep=",")
log = conversion_factory.apply(dataframe)

Exporting logs to CSV files

Exporting capabilities into CSV files are provided for both event stream and log formats.

The following example covers exporting of event streams into CSV. Hereby, the event log structure is converted into a Pandas dataframe, which is then exported to a CSV file:

from pm4py.objects.log.exporter.csv import factory as csv_exporter

csv_exporter.export(event_stream, "outputFile1.csv")

The exporting of logs into CSV is a similar matter. The log is converted into an event stream (the case attributes are reported into the event adding the case: prefix to them), then the event stream structure is converted into a Pandas dataframe and the dataframe is exported to a CSV file

from pm4py.objects.log.exporter.csv import factory as csv_exporter

csv_exporter.export(log, "outputFile2.csv")

Importing / Exporting logs from Parquet format

Apache Parquet is a columnar format that is very good in storing dataframes. Being columnar, it permits to apply efficient compression techniques and achieve a small sized file even for big logs. Moreover, it is very efficient to load and it permits to retrieve only a particular set of columns.

It is a typized format: each column is stored along its type so no data conversion at load time is needed.
The gains are huge in loading time and memory occupation required in comparison to XESLite log management in ProM6.

Bpic2017 application (events: 1202267; cases: 31509)

Loading time XESLite: 49.9s
Loading time Parquet (all columns): 1.93s
Loading time Parquet (only case ID and activity column): 0.33s

Memory needed XESLite: 1203MB
Memory needed Parquet (all columns): 184MB
Memory needed Parquet (only case ID and activity column): 28MB

Bpic2018 (events: 2514266; cases: 43809)

Loading time XESLite: 39.4s
Loading time Parquet (all columns): 4.88s
Loading time Parquet (only case ID and activity column): 0.43s

Memory needed XESLite: 3219MB
Memory needed Parquet (all columns): 784MB
Memory needed Parquet (only case ID and activity column): 58MB

 

In PM4Py, the importing of a Parquet file leads to a Pandas dataframe as output. Any log object in PM4Py (through automatic conversion to Pandas dataframe) could be then stored in a Parquet file.

To import a Parquet log with all its columns, the following instructions could be used:

import os
from pm4py.objects.log.importer.parquet import factory as parquet_importer

log_path = os.path.join("tests", "input_data", "running-example.parquet")
dataframe = parquet_importer.apply(log_path)

To import a Parquet log with only a set of columns, the following instructions could be used:

dataframe = parquet_importer.apply(log_path, parameters={"columns": ["case:concept:name", "concept:name"]})

To see an example about exporting, we could import a log in XES format:

from pm4py.objects.log.importer.xes import factory as xes_importer

log_path = os.path.join("tests", "input_data", "running-example.xes")
log = xes_importer.apply(log_path)

To export a log into Parquet format, the following instructions could be used:

from pm4py.objects.log.exporter.parquet import factory as parquet_exporter

parquet_exporter.apply(log, "running-example.parquet")

Sorting by timestamp

If an event log or a stream is not sorted accordingly to the timestamp of its events, then it is possible to sort the log using the following instructions:

from pm4py.objects.log.util import sorting
log = sorting.sort_timestamp(log)

In the case of event logs, events are simply sorted from the first (according to the timestamp) to the last.

In the case of logs, traces are first internally sorted by the timestamp of their events, and then they are sorted accordingly to the timestamp of the first event.

Sorting by lambda expression

If the sorting operation shall instead be done by using lambda expressions, then it is possible to use the method sorting.sort_lambda providing the regular expression as second argument, and the direction of sort (if reverse=False, then the sort is ascending; if reverse=True, then the sort is descending). The following example sorts the log by sorting on the case ID.

from pm4py.objects.log.util import sorting

sorted_log = sorting.sort_lambda(log, lambda x: x.attributes["concept:name"], reverse=False)

Sampling

This operation, that works for logs objects, consists of taking only a small subset of traces, with the hope they reflect the overall behavior of the process.

This is particularly useful if the log size is huge.

The following code is useful to keep only 50 traces of the log:

from pm4py.objects.log.util import sampling
sampled_log = sampling.sample(log, n=50)

.