Stream Management

In PM4Py, we offer an interface for stream management, that is intended to be an entrypoint accepting events from potentially many sources and sending the same events to potentially many targets.

Here, we offer an overview of the features provided by the interface.

Instantiation and Start of the Stream

The stream can be instantiated through the instructions

from pm4py.streaming.stream.stream import LiveEventStream
stream = LiveEventStream()

Then, it can be started through the command:

stream.start()

From that command, it starts to accept events.

Usage of a Listener Algorithm – the Event Printer example

We offer an example implementation of a listener algorithm, that simply prints the event received from the stream. To use and register that to the stream, the following commands can be provided:

from pm4py.streaming.algo.event_printer import EventPrinter
event_printer = EventPrinter()
stream.register(event_printer)

Creation of a Listener Algorithm – Extending the Interface

The interface is found in pm4py.streaming.algo.interface, and the name of the class is StreamingAlgorithm.

Essentially, a single method (receive) needs to be overriden. To report, as example, the code of the EventPrinter:

from pm4py.streaming.algo import interface

class EventPrinter(interface.StreamingAlgorithm):

    def receive(self, event):
        print(event)

Appending an Event to the Stream – and seeing the reaction of the EventPrinter

For this purpose, the following code can be used, that appends an Event with two attributes to the stream

from pm4py.objects.log.log import Event
stream.append(Event({"case:concept:name": "1", "concept:name": "A"}))

.. The EventPrinter that was added to the stream is responding as expected.

Stopping the Stream

When the stream is not useful anymore, the following code can be used:

stream.stop()

Task Mining – BPM 2019

In PM4Py, we offer an implementation of the task mining approach presented in the AI4BPM workshop of BPM 2019.

The title of the paper is:

Process Mining from Low-level User Actions

from Aviv Yehezkel, Yaron Bialy, Ariel Smutko, Eran Roseberg

 

A summary of the approach is in this sequence of actions:

  • Stream grouping: events temporally near, or connected by a strong connection, are grouped.
  • Task Generalization/Entity Recognition: replace specific entities in titles/descriptions with a generic entity
  • Sequence Mining (Frequent Itemsets; PrefixSpan algorithm)
  • Sequence Clustering

Here, we propose an example of application of task mining on a log recorded by the “clicks recorder”.

First, the CSV log file containing the clicks is imported (with a specific encoding and ignoring possible errors)

import pandas as pd
df = pd.read_csv(r"clicks.csv", sep=";", error_bad_lines=False, encoding="ISO-8859-1")

Then, it is converted to an event stream (i.e. a list of low-level events)

from pm4py.objects.conversion.log import factory as log_conv_factory
stream = log_conv_factory.apply(df, variant=log_conv_factory.TO_EVENT_STREAM)

Then, the task mining approach is applied. The output is a list of processes recognized by the task miner, each one contains a list of recognized sequences (list of steps):

from pm4py.algo.task_mining import factory as task_mining_factory
tasks = task_mining_factory.apply(stream)

The output can be then stored in a file for a nicer visualization:

import json
content = json.dumps(tasks, indent=2)
F = open("dump.json", "w")
F.write(content)
F.close()

A representation of the output is contained in the following snippet. Since the output is a list of processes, and each process is a list of recognized sequences (list of steps), the output starts with two [ (list openings)

[
  [
    {
      "label": [
        "pycharm64.exe SunAwtFrame (830.7, 514.5) pm4py-source [C:\\Users\\aless\\pm4py-source] - ...\\",
        "pycharm64.exe SunAwtFrame (75.0, 27.4) ...\\ - PyCharm"
      ],
      "score": 12.899999856948853,
      "no_occurrences": 3,
      "events": [
        [
          {
            "pid": 9124,
            "process_name": "pycharm64.exe",
            "process_exe": "C:\\Program Files\\JetBrains\\PyCharm Community Edition 2019.2\\bin\\pycharm64.exe",
            "classname": "SunAwtFrame",
            "window_id": 918756,
            "window_name": "pm4py-source [C:\\Users\\aless\\pm4py-source] - ...\\pm4py\\algo\\stream\\tasks\\versions\\equiv_spatial_grouping.py - PyCharm",
            "window_position": "(-7, -7)",
            "window_dimension": "(1208, 768)",
            "event_position": "(800, 521)",
            "event_position_rel": "(807, 528)",
            "username": "aless",
            "computername": "DESKTOP-14M5HJ1",
            "current_timestamp": 1567937944.7,
            "last_screenshot": "screenshots\\Screenshot_1567937942945.png",
            "@@label_index": 12,
            "@@label": "pycharm64.exe SunAwtFrame (830.7, 514.5) pm4py-source [C:\\Users\\aless\\pm4py-source] - ...\\"
          },...

.

Clicks Recorder

An open-source Clicks Recorder for Windows >= 7, that is runnable in a Python 2.7 environment, is freely available on Github at the address

https://github.com/Javert899/clicks-recorder

For the installation, the pywin32 package should be installed from the URL:

https://github.com/mhammond/pywin32/releases

And the requirements contained in the requirements.txt file can be installed through (NB: it is only compatible with Python 2.7 and Windows 🙂 )

pip install -U -r requirements.txt

The click recorder can be run through

python clicks.py

Output: for each click, a screenshot is saved into the screenshots folder, along with an entry in a CSV file. The name of the CSV file contains the name of the workstation and the name of the user (this in the case of the usage of a centralized repository).

An example of such CSV file can be found at the address http://www.alessandroberti.it/clicks_aless_DESKTOP-14M5HJ1.csv

Columns of the CSV file:

pid => the identifier of the process managing the window where the click happens

process_name => the name of the process having such PID

process_exe => the executable path of the process having such PID

classname => the name of the class (inside the process) that manages the window where the click happens

window_id => Window ID (of the window where the click happens)

window_name => Window name

window_position => Window position

window_dimension => Dimensions (width, height) of the window

event_position => Absolute coordinates of the current click

event_position_rel => Relative coordinates of the current click with regards to the position of the window

username => Username of the user that is using the computer

computername => Name of the computer

current_timestamp => Timestamp of the current event

last_screenshot => Screenshot that is associated to the event

Task Mining

With the term “Task Mining”, we refer to a complementary approach to Process Mining, where we want to infer useful information from low-level event data that is describing the single steps done by an user, for example in using his workstation.

With Task Mining, the final aim is to obtain a list of steps that can be subject to automation (RPA; robotic process automation), rather than a complete end-to-end process schema.

We will introduce:

  • A click-recorder package that can be easily installed in a Windows environment
  • An open-source implementation of the task mining approach presented in the BPM 2019 conference

PM4Py Dependencies Licensing

PM4Py, in its current state, is supported by open source libraries, released in MIT/BSD/Apache 2.0 and other (very) free licenses. The complete list is found below:

 

Package Link License
MarkupSafe https://pypi.org/project/MarkupSafe/ BSD 3 cl.
backcall https://pypi.org/project/backcall/ BSD
ciso8601 https://pypi.org/project/ciso8601/ MIT
colorama https://pypi.org/project/colorama/ BSD
cycler https://pypi.org/project/Cycler/ BSD
decorator https://pypi.org/project/decorator/ BSD
graphviz https://pypi.org/project/graphviz/ MIT
ipython https://pypi.org/project/ipython/ BSD
ipython-genutils https://pypi.org/project/ipython_genutils/ BSD
jedi https://pypi.org/project/jedi/ MIT
jinja2 https://pypi.org/project/Jinja2/ BSD
joblib https://pypi.org/project/joblib/ BSD
kiwisolver https://pypi.org/project/kiwisolver/ BSD
lxml https://pypi.org/project/lxml/ BSD
matplotlib https://pypi.org/project/matplotlib/ Python Software Foundation License (PSF, BSD-style)
networkx https://pypi.org/project/networkx/ BSD
numpy https://pypi.org/project/numpy/ BSD
ortools https://pypi.org/project/ortools/ Apache 2.0
pandas https://pypi.org/project/pandas/ BSD
parso https://pypi.org/project/parso/ MIT
pickleshare https://pypi.org/project/pickleshare/ MIT
prompt-toolkit https://pypi.org/project/prompt_toolkit/ BSD 3 cl.
protobuf https://pypi.org/project/protobuf/ BSD 3 cl.
pulp https://github.com/coin-or/pulp/blob/master/LICENSE Custom license (free, MIT style)
pyarrow https://pypi.org/project/pyarrow/ Apache 2.0
pydotplus https://pypi.org/project/pydotplus/ MIT
pygments https://pypi.org/project/Pygments/ BSD
pyparsing https://pypi.org/project/pyparsing/ MIT
python-dateutil https://pypi.org/project/python-dateutil/ Apache, BSD (dual license)
pytz https://pypi.org/project/pytz/ MIT
pyvis https://pypi.org/project/pyvis/ BSD
scikit-learn https://pypi.org/project/scikit-learn/ OSI (new BSD)
scipy https://pypi.org/project/scipy/ BSD
setuptools https://pypi.org/project/setuptools/ MIT
six https://pypi.org/project/six/ MIT
traitlets https://pypi.org/project/traitlets/ BSD
wcwidth https://pypi.org/project/wcwidth/ MIT
prefixspan https://pypi.org/project/prefixspan/ MIT
docopt https://pypi.org/project/docopt/ MIT
extratools https://pypi.org/project/extratools/ MIT
sortedcontainers https://pypi.org/project/sortedcontainers/ Apache 2.0
toolz https://pypi.org/project/toolz/ BSD

.

Incremental Calculation of Cycle and Lead Time

Incremental calculation of Cycle and Lead Time

Two important KPI for a process executions are:

  • The Lead Time: the overall time in which the instance was worked, from the start to the end, without considering if it was actively worked or not.
  • The Cycle Time: the overall time in which the instance was worked, from the start to the end, considering only the times where it was actively worked.

For these concepts, it is important to consider only business hours (so, excluding nights and weekends). Indeed, in that period the machinery and the workforce is at home, so could not proceed in working the instance, so the time “wasted” there is not recoverable.

Within ‘interval’ event logs (that have a start and an end timestamp), it is possible to calculate incrementally the lead time and the cycle time (event per event). The lead time and the cycle time that are reported on the last event of the case are the ones related to the process execution. With this, it is easy to understand which activities of the process have caused a bottleneck (e.g. the lead time increases significantly more than the cycle time).

The algorithm implemented in PM4Py start sorting each case by the start timestamp (so, activities started earlier are reported earlier in the log), and is able to calculate the lead and cycle time in all the situations, also the complex ones reported in the following picture:

In the following, we aim to insert the following attributes to events inside a log:

  • @@approx_bh_partial_cycle_time => incremental cycle time associated to the event (the cycle time of the last event is the cycle time of the instance)
  • @@approx_bh_partial_lead_time => incremental lead time associated to the event
  • @@approx_bh_overall_wasted_time => difference between the partial lead time and the partial cycle time values
  • @@approx_bh_this_wasted_time => wasted time ONLY with regards to the activity described by the ‘interval’ event
  • @@approx_bh_ratio_cycle_lead_time => measures the incremental Flow Rate (between 0 and 1).

The method that calculates the lead and the cycle time could accept the following optional parameters:

  • worktiming: the work timing (e.g. [7, 17])
  • weekends: the specification of the weekends (e.g. [6, 7])

And could be applied with the following line of code:

from pm4py.objects.log.util import interval_lifecycle

enriched_log = interval_lifecycle.assign_lead_cycle_time(log)

With this, an enriched log that contains for each event the corresponding attributes for lead/cycle time is obtained.

Filtering on activities duration

Having a log enriched with interval and duration information, it is possible to use this log to identify the process instances where the actual cycle time (COMPLETE – START) of an activity is in a given range

To do so, it is possible to use the numeric attribute filter. Let’s suppose to start with the BPI Challenge 2012 log (for velocity, only the first 50 cases are actually imported).

from pm4py.objects.log.importer.xes import factory as xes_importer

log = xes_importer.apply("bpic2012.xes", variant="nonstandard", parameters={"max_no_traces_to_import": 50})

and to use the log utility to transform it into an “interval” event log

from pm4py.objects.log.util import interval_lifecycle

enriched_log = interval_lifecycle.assign_lead_cycle_time(log)

Then, suppose we want to filter the cases where the activity W_Nabellen offertes (from START to COMPLETE) has had a duration comprised between 300.0 and 1000.0 seconds. The numeric attribute filter can be used:

from pm4py.algo.filtering.log.attributes import attributes_filter

With the following specification (here, the attribute to which the numeric attribute filter is applied is the @@duration, while an additional filter is imposed on the concept:name attribute to be equal to W_Nabellen offertes)

filtered_log = attributes_filter.apply_numeric(deepcopy(enriched_log), 300.0, 1000.0, parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "@@duration", "stream_filter_key1": "concept:name", "stream_filter_value1": "W_Nabellen offertes"})

Suppose we want to impose the further restriction on having the activity W_Nabellen offertes done by the resource 10931, then the following code can be provided:

filtered_log2 = attributes_filter.apply_numeric(deepcopy(enriched_log), 300.0, 1000.0, parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "@@duration", "stream_filter_key1": "concept:name", "stream_filter_value1": "W_Nabellen offertes", "stream_filter_key2": "org:resource", "stream_filter_value2": "10913"})

.

Business Hours

Business Hours module

Given an interval event log (an EventLog object where each event is characterised by two timestamps, a start timestamp usually contained in the start_timestamp attribute and a completion timestamp usually contained in the time:timestamp attribute), the “duration” of the event is the difference between the completion timestamp and the start timestamp. This may be inficiated by nights (where an activity is not actively worked), weekends (where the workers may not be at the workplace) and other kind of pauses. In PM4Py, a way to consider only the time in which the activity could actually be worked (so, excluding time outside of the working hours and weekends) is provided.

Given a start and end timestamp (expressed as UNIX timestamps), the business hours calculation method could be called as follows:

from pm4py.util.business_hours import BusinessHours
from datetime import datetime

st = datetime.fromtimestamp(100000000)
et = datetime.fromtimestamp(200000000)
bh_object = BusinessHours(st, et)
worked_time = bh_object.getseconds()
print(worked_time)

Obtaining 29736000 for the specific example.

To provide specific shifts and weekends (for example, always short weeks with 4 working days 🙂 and work days from 10 to 16) the following code could be used:

bh_object = BusinessHours(st, et, worktiming=[10, 16], weekends=[5, 6, 7])
worked_time = bh_object.getseconds()
print(worked_time)

The business hours duration method is called authomatically in the following parts of PM4Py:

  • Conversion of lifecycle log to interval log (in the case the business hours are explicitly required) (optional provision of worktiming and weekends parameters if they differ from the standard values)
  • Calculation of process cycle and lead time (optional provision of worktiming and weekends parameters if they differ from the standard values)

Calculation of the time passed from the previous activity

This calculation is simply based on the direct succession of events in a case, and does not take into account the concurrency between the activities

An utility has been provided in PM4Py to insert in each event of the log the passed time from the previous event (attribute @@passed_time_from_previous) and the approximated business time from the previous event (attribute @@approx_bh_passed_time_from_previous).

In the following example, we apply the utility

We start importing the receipt.xes log

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
log = xes_importer.apply(os.path.join("tests", "input_data", "receipt.xes"))

Then, the time from the previous event is inserted into the log (inside the attributes that are written above):

from pm4py.objects.log.util import time_from_previous
log = time_from_previous.insert_time_from_previous(log)

Then, the log can be exported:

from pm4py.objects.log.exporter.xes import factory as xes_exporter
xes_exporter.apply(log, "receipt.xes")

If the log is open, you can see that the insertion has been successful.

.

Benchmarks

Benchmarks

Having some benchmark, negative or positive it is, is important for us. We attach in this page some of the benchmarks that have been done in PM4Py in comparison to other Process Mining tools.

Importing a CSV and calculating the frequency DFG

The most widely used format in data extraction from database is the CSV format. In this section, we assume to measure the times of loading the CSV and calculating the Directly Follows Graph on a CSV.

The script that has been used in PM4Py to measure the CSV import times, and calculate the DFG, is the following (tests have been done on a I7-7550U with 16 GB of DDR4 RAM):

from pm4py.objects.log.adapters.pandas import csv_import_adapter
from pm4py.algo.discovery.dfg.adapters.pandas import df_statistics
import time

aa = time.time()
df = csv_import_adapter.import_dataframe_from_path_wo_timeconversion("C:\\csv_logs\\Billing.csv")
#df = csv_import_adapter.convert_timestamp_columns_in_df(df, timest_columns=["time:timestamp"], timest_format="%Y-%m-%d %H:%M:%S")
#df = df.sort_values(["case:concept:name", "time:timestamp"])
bb = time.time()
dfg = df_statistics.get_dfg_graph(df, measure="frequency", sort_caseid_required=False)
cc = time.time()
print(bb-aa)
print(cc-bb)

 

Obtaining the following results comparison to the CSV importer included in ProM 6:

When timestamp columns are converted and a sort operation is done, the Pandas CSV importer performs in some cases better and in some cases equal to the ProM6 CSV importer. When no sort is applied, the Pandas importing is much faster.

When calculating the DFG, in comparison to the analogous plug-in in ProM 6 that could be found inside Inductive Miner, the results are the following (tests have been done on a I7-7550U with 16 GB of DDR4 RAM):

So the Pandas approach in PM4Py is seemingly working very well in comparison to the ProM 6 implementation.

Also comparing PM4Py to a leading commercial software in this context, better performance is obtained:

Importing a XES file

ProM offers a big choice of plug-ins that are able to import XES files. Choosing one to refer to is difficult, but since RAM is not a problem, the Naive importer has been chosen.

PM4Py on the other side offers two different XES importers:

  • A standard, certified, importer that relies on the standard LXML library and is able to handle any sort of XML files.
  • A non-standard, non-certified, importer that is a “line parser” and is able to parse only pretty-printed XML files.

The results of the comparison are the following:

As it could been seen in the table, PM4Py could not handle XES at the same speed than ProM. The non-standard importer figures better, but fails on both the BPI Challenge 2017 logs.

Applying Process Discovery techniques

In PM4Py, we offer two process discovery techniques, namely the Alpha Miner and an implementation of the Inductive Miner Directly-Follows.

The most time consuming part in applying the two algorithms on big logs consist in retrieving the basic structure (e.g. causal relations for Alpha Miner, DFG for Inductive Miner).

A direct comparison could be made on the execution speed of the Alpha Miner in some big logs. Here, PM4Py loses the speed competition over the ProM implementation:

Applying Conformance Checking techniques

In PM4Py, two different conformance checking techniques are provided:

  • Token-based replay
  • Alignments

The ProM6 Process Mining framework offers the alignments technique (through the “Conformance Analysis” plug-in).

Token-based replay is a very fast technique, and outperforms the competition from ProM6. Here times are obtained from the log and a model obtained by IMDFA implementation in PM4Py

Unfortunately, alignments in P4MPy are not (still) on the pace of the implementation provided in ProM6.

Applying filtering

Filtering is a very important operation in Process Mining because it helps to restrict the log to the information that is really useful.

We have filtering at three different levels:

  • Event level (keeping only events in the log that satisfy a particular pattern)
  • Case level (keeping only the cases in the log which events satisfy a particular pattern)
  • Variant level (e.g. keeping only cases where we have a set of activities in the log)

The speed comparison has been made between the Pandas dataframe implementation in PM4Py and the XLog implementation in ProM6.

In the following table, the speed of filtering the log on the most frequent activity of the log, getting another log as output, is measured in ProM6 and in PM4Py:

In the following table, the speed of retrieving the variants from the log is measured in ProM6 and in PM4Py:

.