Concept Drift Detection

Concept Drift Detection

BETA feature: PM4Py offers support to concept drift detection on the ‘decisionTreeIntegration’ branch

Concept Drift detection permits to understand in which points the process changes over time. The detection could take into account several aspects, like:

  • The change of the values associated to some attributes during the time
  • The change of the directly following of activities

An overview of techniques to detect concept drift, based on the control flow perspective, is contained in the scientific paper:

Bose, RP Jagadeesh Chandra, et al. “Handling concept drift in process mining.” International Conference on Advanced Information Systems Engineering. Springer, Berlin, Heidelberg, 2011.

In PM4Py a completely new off-line approach to detect concept drift has been implemented, that is able to use perspectives that are different from the classic control flow one, to cover cases where the change happen on the data perspective.

  • First, a representation as set of features for each case is automatically obtained (the same procedure that feeds the decision trees)
  • At second time, a Principal Component Analysis is performed in order to reduce the number of components
  • The start and end timestamp of the trace are added as features
  • The Birch clustering algorithm is applied, trying several numbers of clusters
  • Clusters that got output by the Birch are analyzed in order to discover if they are clearly separated by the timestamp
  • If multiple options are possible, then the one that maximized the Silhouette score is preferred

Let’s provide an example on how to calculate the concept drift. First, a log could be loaded:

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
log_path = os.path.join("tests", "input_data", "receipt.xes")
log = xes_importer.apply(log_path)

Then, the concept drift detection method could be applied:

from pm4py.algo.other.conceptdrift import factory as concept_drift_factory
drift_found, logs_list, endpoints, change_date_repr = concept_drift_factory.apply(log)

There are several outputs of the algorithm:

  • drift_found: indicates if the drift has actually been found
  • logs_list: different sublogs, belonging to different timeframes, that are different each other
  • endpoints: Interesting time points of the concept drift between two successive logs (e.g. third quartile of log at index i and first quartile of log at index i+1)
  • change_date_repr: A single date representing an estimation of when the concept drift actually happened between two successive logs

To really understand the differences between the logs, a decision tree could be built (for example taking the first period log and the second period log):

from pm4py.algo.other.conceptdrift.utils import get_representation
clf, feature_names, classes = get_representation.get_decision_tree(logs_list[0], logs_list[1])

And visualized:

from pm4py.visualization.decisiontree import factory as dec_tree_visualization
gviz = dec_tree_visualization.apply(clf, feature_names, classes, parameters={"format": "svg"})