Trace Clustering

BETA feature: PM4Py offers support to clustering on the ‘decisionTreeIntegration’ branch

Trace Clustering

Clustering is an important functionality that splits the traces of the log into groups having similar behavior.

Ideas behind the implementation are described in scientific paper

de Leoni, Massimiliano, Wil MP van der Aalst, and Marcus Dees. “A general process mining framework for correlating, predicting and clustering dynamic behavior based on event logs.”

Section 3 (extracting relevant process characteristics) and 6 (trace clustering)

The algorithm that has been implemented in PM4Py is basilar:

  • An one-hot encoding of the activities of the single events is obtained.
  • A PCA is perform to reduce the number of components that are considered by the clustering algorithm.
  • The DBSCAN clustering algorithm is applied in order to split the traces into groups.

The clustering operations, from a log, returns a list of logs (sublogs). The union of these logs is the original log. Let’s provide an example.

First, a log could be loaded in the system:

import os
from pm4py.objects.log.importer.xes import factory as xes_importer

log = xes_importer.apply(os.path.join("tests", "input_data", "receipt.xes"))

And the clustering operation may be performed:

from pm4py.algo.other.clustering import factory as clusterer

clusters = clusterer.apply(log)

The following code may be useful to print the single clusters:

for sublog in clusters:
    print(sublog)

Clustering based on case durations

Aside from the classic clustering, it may be important to have the possibility to find cluster of cases with similar duration. This is possible thanks to the ‘duration’ variant of the clustering algorithm. An example could be provided: first, a log could be loaded in the system:

import os
from pm4py.objects.log.importer.xes import factory as xes_importer

log = xes_importer.apply(os.path.join("tests", "input_data", "receipt.xes"))

And the clustering operation may be performed, where the Birch algorithm is applied to the list of case durations.

from pm4py.algo.other.clustering import factory as clusterer

clusters = clusterer.apply(log, variant="duration")

In the example, two clusters are found, of 1416 and 18 cases respectively (normal-duration cases, and extraordinary duration cases):

print([len(x) for x in clusters])

To visualize the durations of the extraordinary duration cases, the following commands could be provided:

from pm4py.statistics.traces.log.case_statistics import get_all_casedurations

durations = get_all_casedurations(clusters[1])
print(durations)

To visualize the features taking to normal or extraordinary duration cases, the following utility (from the concept drift detection package) could be used:

from pm4py.algo.other.conceptdrift.utils import get_representation

clf, feature_names, classes = get_representation.get_decision_tree(clusters[0], clusters[1])

Getting the following representation:

.