Benchmarks

Benchmarks

Having some benchmark, negative or positive it is, is important for us. We attach in this page some of the benchmarks that have been done in PM4Py in comparison to other Process Mining tools.

Importing a CSV and calculating the frequency DFG

The most widely used format in data extraction from database is the CSV format. In this section, we assume to measure the times of loading the CSV and calculating the Directly Follows Graph on a CSV.

The script that has been used in PM4Py to measure the CSV import times, and calculate the DFG, is the following (tests have been done on a I7-7550U with 16 GB of DDR4 RAM):

from pm4py.objects.log.adapters.pandas import csv_import_adapter
from pm4py.algo.discovery.dfg.adapters.pandas import df_statistics
import time

aa = time.time()
df = csv_import_adapter.import_dataframe_from_path_wo_timeconversion("C:\\csv_logs\\Billing.csv")
#df = csv_import_adapter.convert_timestamp_columns_in_df(df, timest_columns=["time:timestamp"], timest_format="%Y-%m-%d %H:%M:%S")
#df = df.sort_values(["case:concept:name", "time:timestamp"])
bb = time.time()
dfg = df_statistics.get_dfg_graph(df, measure="frequency", sort_caseid_required=False)
cc = time.time()
print(bb-aa)
print(cc-bb)

 

Obtaining the following results comparison to the CSV importer included in ProM 6:

When timestamp columns are converted and a sort operation is done, the Pandas CSV importer performs in some cases better and in some cases equal to the ProM6 CSV importer. When no sort is applied, the Pandas importing is much faster.

When calculating the DFG, in comparison to the analogous plug-in in ProM 6 that could be found inside Inductive Miner, the results are the following (tests have been done on a I7-7550U with 16 GB of DDR4 RAM):

So the Pandas approach in PM4Py is seemingly working very well in comparison to the ProM 6 implementation.

Also comparing PM4Py to a leading commercial software in this context, better performance is obtained:

Importing a XES file

ProM offers a big choice of plug-ins that are able to import XES files. Choosing one to refer to is difficult, but since RAM is not a problem, the Naive importer has been chosen.

PM4Py on the other side offers two different XES importers:

  • A standard, certified, importer that relies on the standard LXML library and is able to handle any sort of XML files.
  • A non-standard, non-certified, importer that is a “line parser” and is able to parse only pretty-printed XML files.

The results of the comparison are the following:

As it could been seen in the table, PM4Py could not handle XES at the same speed than ProM. The non-standard importer figures better, but fails on both the BPI Challenge 2017 logs.

Applying Process Discovery techniques

In PM4Py, we offer two process discovery techniques, namely the Alpha Miner and an implementation of the Inductive Miner Directly-Follows.

The most time consuming part in applying the two algorithms on big logs consist in retrieving the basic structure (e.g. causal relations for Alpha Miner, DFG for Inductive Miner).

A direct comparison could be made on the execution speed of the Alpha Miner in some big logs. Here, PM4Py loses the speed competition over the ProM implementation:

Applying Conformance Checking techniques

In PM4Py, two different conformance checking techniques are provided:

  • Token-based replay
  • Alignments

The ProM6 Process Mining framework offers the alignments technique (through the “Conformance Analysis” plug-in).

Token-based replay is a very fast technique, and outperforms the competition from ProM6. Here times are obtained from the log and a model obtained by IMDFA implementation in PM4Py

Unfortunately, alignments in P4MPy are not (still) on the pace of the implementation provided in ProM6.

Applying filtering

Filtering is a very important operation in Process Mining because it helps to restrict the log to the information that is really useful.

We have filtering at three different levels:

  • Event level (keeping only events in the log that satisfy a particular pattern)
  • Case level (keeping only the cases in the log which events satisfy a particular pattern)
  • Variant level (e.g. keeping only cases where we have a set of activities in the log)

The speed comparison has been made between the Pandas dataframe implementation in PM4Py and the XLog implementation in ProM6.

In the following table, the speed of filtering the log on the most frequent activity of the log, getting another log as output, is measured in ProM6 and in PM4Py:

In the following table, the speed of retrieving the variants from the log is measured in ProM6 and in PM4Py:

.