Trace Clustering

BETA feature: PM4Py offers support to clustering on the ‘decisionTreeIntegration’ branch

Clustering is an important functionality that splits the traces of the log into groups having similar behavior. The algorithm that has been implemented in PM4Py is basilar:

  • An one-hot encoding of the activities of the single events is obtained.
  • A PCA is perform to reduce the number of components that are considered by the clustering algorithm.
  • The DBSCAN clustering algorithm is applied in order to split the traces into groups.

The clustering operations, from a log, returns a list of logs (sublogs). The union of these logs is the original log. Let’s provide an example.

First, a log could be loaded in the system:

import os
from pm4py.objects.log.importer.xes import factory as xes_importer

log = xes_importer.apply(os.path.join("tests", "input_data", "receipt.xes"))

And the clustering operation may be performed:

from pm4py.algo.other.clustering import factory as clusterer

clusters = clusterer.apply(log)

The following code may be useful to print the single clusters:

for sublog in clusters: