PM4Py Dependencies Licensing

PM4Py, in its current state, is supported by open source libraries, released in MIT/BSD/Apache 2.0 and other (very) free licenses. The complete list is found below:

Package Link License
MarkupSafe https://pypi.org/project/MarkupSafe/ BSD 3 cl.
backcall https://pypi.org/project/backcall/ BSD
ciso8601 https://pypi.org/project/ciso8601/ MIT
colorama https://pypi.org/project/colorama/ BSD
cycler https://pypi.org/project/Cycler/ BSD
decorator https://pypi.org/project/decorator/ BSD
graphviz https://pypi.org/project/graphviz/ MIT
ipython https://pypi.org/project/ipython/ BSD
ipython-genutils https://pypi.org/project/ipython_genutils/ BSD
jedi https://pypi.org/project/jedi/ MIT
jinja2 https://pypi.org/project/Jinja2/ BSD
joblib https://pypi.org/project/joblib/ BSD
kiwisolver https://pypi.org/project/kiwisolver/ BSD
lxml https://pypi.org/project/lxml/ BSD
matplotlib https://pypi.org/project/matplotlib/ Python Software Foundation License (PSF, BSD-style)
networkx https://pypi.org/project/networkx/ BSD
numpy https://pypi.org/project/numpy/ BSD
ortools https://pypi.org/project/ortools/ Apache 2.0
pandas https://pypi.org/project/pandas/ BSD
parso https://pypi.org/project/parso/ MIT
pickleshare https://pypi.org/project/pickleshare/ MIT
prompt-toolkit https://pypi.org/project/prompt_toolkit/ BSD 3 cl.
protobuf https://pypi.org/project/protobuf/ BSD 3 cl.
pulp https://github.com/coin-or/pulp/blob/master/LICENSE Custom license (free, MIT style)
pyarrow https://pypi.org/project/pyarrow/ Apache 2.0
pydotplus https://pypi.org/project/pydotplus/ MIT
pygments https://pypi.org/project/Pygments/ BSD
pyparsing https://pypi.org/project/pyparsing/ MIT
python-dateutil https://pypi.org/project/python-dateutil/ Apache, BSD (dual license)
pytz https://pypi.org/project/pytz/ MIT
pyvis https://pypi.org/project/pyvis/ BSD
scikit-learn https://pypi.org/project/scikit-learn/ OSI (new BSD)
scipy https://pypi.org/project/scipy/ BSD
setuptools https://pypi.org/project/setuptools/ MIT
six https://pypi.org/project/six/ MIT
traitlets https://pypi.org/project/traitlets/ BSD
wcwidth https://pypi.org/project/wcwidth/ MIT

.

Incremental Calculation of Cycle and Lead Time

Incremental calculation of Cycle and Lead Time

Two important KPI for a process executions are:

  • The Lead Time: the overall time in which the instance was worked, from the start to the end, without considering if it was actively worked or not.
  • The Cycle Time: the overall time in which the instance was worked, from the start to the end, considering only the times where it was actively worked.

For these concepts, it is important to consider only business hours (so, excluding nights and weekends). Indeed, in that period the machinery and the workforce is at home, so could not proceed in working the instance, so the time “wasted” there is not recoverable.

Within ‘interval’ event logs (that have a start and an end timestamp), it is possible to calculate incrementally the lead time and the cycle time (event per event). The lead time and the cycle time that are reported on the last event of the case are the ones related to the process execution. With this, it is easy to understand which activities of the process have caused a bottleneck (e.g. the lead time increases significantly more than the cycle time).

The algorithm implemented in PM4Py start sorting each case by the start timestamp (so, activities started earlier are reported earlier in the log), and is able to calculate the lead and cycle time in all the situations, also the complex ones reported in the following picture:

In the following, we aim to insert the following attributes to events inside a log:

  • @@approx_bh_partial_cycle_time => incremental cycle time associated to the event (the cycle time of the last event is the cycle time of the instance)
  • @@approx_bh_partial_lead_time => incremental lead time associated to the event
  • @@approx_bh_overall_wasted_time => difference between the partial lead time and the partial cycle time values
  • @@approx_bh_this_wasted_time => wasted time ONLY with regards to the activity described by the ‘interval’ event
  • @@approx_bh_ratio_cycle_lead_time => measures the incremental Flow Rate (between 0 and 1).

The method that calculates the lead and the cycle time could accept the following optional parameters:

  • worktiming: the work timing (e.g. [7, 17])
  • weekends: the specification of the weekends (e.g. [6, 7])

And could be applied with the following line of code:

from pm4py.objects.log.util import interval_lifecycle

enriched_log = interval_lifecycle.assign_lead_cycle_time(log)

With this, an enriched log that contains for each event the corresponding attributes for lead/cycle time is obtained.

Filtering on activities duration

Having a log enriched with interval and duration information, it is possible to use this log to identify the process instances where the actual cycle time (COMPLETE – START) of an activity is in a given range

To do so, it is possible to use the numeric attribute filter. Let’s suppose to start with the BPI Challenge 2012 log (for velocity, only the first 50 cases are actually imported).

from pm4py.objects.log.importer.xes import factory as xes_importer

log = xes_importer.apply("bpic2012.xes", variant="nonstandard", parameters={"max_no_traces_to_import": 50})

and to use the log utility to transform it into an “interval” event log

from pm4py.objects.log.util import interval_lifecycle

enriched_log = interval_lifecycle.assign_lead_cycle_time(log)

Then, suppose we want to filter the cases where the activity W_Nabellen offertes (from START to COMPLETE) has had a duration comprised between 300.0 and 1000.0 seconds. The numeric attribute filter can be used:

from pm4py.algo.filtering.log.attributes import attributes_filter

With the following specification (here, the attribute to which the numeric attribute filter is applied is the @@duration, while an additional filter is imposed on the concept:name attribute to be equal to W_Nabellen offertes)

filtered_log = attributes_filter.apply_numeric(deepcopy(enriched_log), 300.0, 1000.0, parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "@@duration", "stream_filter_key1": "concept:name", "stream_filter_value1": "W_Nabellen offertes"})

Suppose we want to impose the further restriction on having the activity W_Nabellen offertes done by the resource 10931, then the following code can be provided:

filtered_log2 = attributes_filter.apply_numeric(deepcopy(enriched_log), 300.0, 1000.0, parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "@@duration", "stream_filter_key1": "concept:name", "stream_filter_value1": "W_Nabellen offertes", "stream_filter_key2": "org:resource", "stream_filter_value2": "10913"})

.

Business Hours

Business Hours module

Given an interval event log (an EventLog object where each event is characterised by two timestamps, a start timestamp usually contained in the start_timestamp attribute and a completion timestamp usually contained in the time:timestamp attribute), the “duration” of the event is the difference between the completion timestamp and the start timestamp. This may be inficiated by nights (where an activity is not actively worked), weekends (where the workers may not be at the workplace) and other kind of pauses. In PM4Py, a way to consider only the time in which the activity could actually be worked (so, excluding time outside of the working hours and weekends) is provided.

Given a start and end timestamp (expressed as UNIX timestamps), the business hours calculation method could be called as follows:

from pm4py.util.business_hours import BusinessHours
from datetime import datetime

st = datetime.fromtimestamp(100000000)
et = datetime.fromtimestamp(200000000)
bh_object = BusinessHours(st, et)
worked_time = bh_object.getseconds()
print(worked_time)

Obtaining 29736000 for the specific example.

To provide specific shifts and weekends (for example, always short weeks with 4 working days 🙂 and work days from 10 to 16) the following code could be used:

bh_object = BusinessHours(st, et, worktiming=[10, 16], weekends=[5, 6, 7])
worked_time = bh_object.getseconds()
print(worked_time)

The business hours duration method is called authomatically in the following parts of PM4Py:

  • Conversion of lifecycle log to interval log (in the case the business hours are explicitly required) (optional provision of worktiming and weekends parameters if they differ from the standard values)
  • Calculation of process cycle and lead time (optional provision of worktiming and weekends parameters if they differ from the standard values)

Calculation of the time passed from the previous activity

This calculation is simply based on the direct succession of events in a case, and does not take into account the concurrency between the activities

An utility has been provided in PM4Py to insert in each event of the log the passed time from the previous event (attribute @@passed_time_from_previous) and the approximated business time from the previous event (attribute @@approx_bh_passed_time_from_previous).

In the following example, we apply the utility

We start importing the receipt.xes log

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
log = xes_importer.apply(os.path.join("tests", "input_data", "receipt.xes"))

Then, the time from the previous event is inserted into the log (inside the attributes that are written above):

from pm4py.objects.log.util import time_from_previous
log = time_from_previous.insert_time_from_previous(log)

Then, the log can be exported:

from pm4py.objects.log.exporter.xes import factory as xes_exporter
xes_exporter.apply(log, "receipt.xes")

If the log is open, you can see that the insertion has been successful.

.

Benchmarks

Benchmarks

Having some benchmark, negative or positive it is, is important for us. We attach in this page some of the benchmarks that have been done in PM4Py in comparison to other Process Mining tools.

Importing a CSV and calculating the frequency DFG

The most widely used format in data extraction from database is the CSV format. In this section, we assume to measure the times of loading the CSV and calculating the Directly Follows Graph on a CSV.

The script that has been used in PM4Py to measure the CSV import times, and calculate the DFG, is the following (tests have been done on a I7-7550U with 16 GB of DDR4 RAM):

from pm4py.objects.log.adapters.pandas import csv_import_adapter
from pm4py.algo.discovery.dfg.adapters.pandas import df_statistics
import time

aa = time.time()
df = csv_import_adapter.import_dataframe_from_path_wo_timeconversion("C:\\csv_logs\\Billing.csv")
#df = csv_import_adapter.convert_timestamp_columns_in_df(df, timest_columns=["time:timestamp"], timest_format="%Y-%m-%d %H:%M:%S")
#df = df.sort_values(["case:concept:name", "time:timestamp"])
bb = time.time()
dfg = df_statistics.get_dfg_graph(df, measure="frequency", sort_caseid_required=False)
cc = time.time()
print(bb-aa)
print(cc-bb)

 

Obtaining the following results comparison to the CSV importer included in ProM 6:

When timestamp columns are converted and a sort operation is done, the Pandas CSV importer performs in some cases better and in some cases equal to the ProM6 CSV importer. When no sort is applied, the Pandas importing is much faster.

When calculating the DFG, in comparison to the analogous plug-in in ProM 6 that could be found inside Inductive Miner, the results are the following (tests have been done on a I7-7550U with 16 GB of DDR4 RAM):

So the Pandas approach in PM4Py is seemingly working very well in comparison to the ProM 6 implementation.

Also comparing PM4Py to a leading commercial software in this context, better performance is obtained:

Importing a XES file

ProM offers a big choice of plug-ins that are able to import XES files. Choosing one to refer to is difficult, but since RAM is not a problem, the Naive importer has been chosen.

PM4Py on the other side offers two different XES importers:

  • A standard, certified, importer that relies on the standard LXML library and is able to handle any sort of XML files.
  • A non-standard, non-certified, importer that is a “line parser” and is able to parse only pretty-printed XML files.

The results of the comparison are the following:

As it could been seen in the table, PM4Py could not handle XES at the same speed than ProM. The non-standard importer figures better, but fails on both the BPI Challenge 2017 logs.

Applying Process Discovery techniques

In PM4Py, we offer two process discovery techniques, namely the Alpha Miner and an implementation of the Inductive Miner Directly-Follows.

The most time consuming part in applying the two algorithms on big logs consist in retrieving the basic structure (e.g. causal relations for Alpha Miner, DFG for Inductive Miner).

A direct comparison could be made on the execution speed of the Alpha Miner in some big logs. Here, PM4Py loses the speed competition over the ProM implementation:

Applying Conformance Checking techniques

In PM4Py, two different conformance checking techniques are provided:

  • Token-based replay
  • Alignments

The ProM6 Process Mining framework offers the alignments technique (through the “Conformance Analysis” plug-in).

Token-based replay is a very fast technique, and outperforms the competition from ProM6. Here times are obtained from the log and a model obtained by IMDFA implementation in PM4Py

Unfortunately, alignments in P4MPy are not (still) on the pace of the implementation provided in ProM6.

Applying filtering

Filtering is a very important operation in Process Mining because it helps to restrict the log to the information that is really useful.

We have filtering at three different levels:

  • Event level (keeping only events in the log that satisfy a particular pattern)
  • Case level (keeping only the cases in the log which events satisfy a particular pattern)
  • Variant level (e.g. keeping only cases where we have a set of activities in the log)

The speed comparison has been made between the Pandas dataframe implementation in PM4Py and the XLog implementation in ProM6.

In the following table, the speed of filtering the log on the most frequent activity of the log, getting another log as output, is measured in ProM6 and in PM4Py:

In the following table, the speed of retrieving the variants from the log is measured in ProM6 and in PM4Py:

.