Correlation Miner

Correlation Miner

In Process Mining, we are used to have logs containing at least:

  • A case identifier
  • An activity
  • A timestamp

The case identifier associates an event, happening to a system, to a particular execution of the process. This permits to apply algorithms such as process discovery, conformance checking, …

However, in some systems (for example, the data collected from IoT systems), it may be difficult to associate a case identifier. On top of these logs, performing classic process mining is impossible. Correlation mining borns as a response to the challenge to extract a process model from such event logs, that permits to read useful information that is contained in the logs without a case identifier, that contains only:

  • An activity column
  • A timestamp column

In this description, we assume there is a total order on events (that means that no events happen in the same timestamp). Situations where a total order is not defined are more complicated.

The Correlation Miner is an approach proposed in

Pourmirza, Shaya, Remco Dijkman, and Paul Grefen. “Correlation miner: mining business process models and event correlations without case identifiers.” International Journal of Cooperative Information Systems 26.02 (2017): 1742002.

That aims to resolve this problem by resolving an (integer) linear problem defined on top of:

  • The P/S matrix: expressing the relationship of order between the activities as recorded in the log.
  • The Duration matrix: expressing an aggregation of the duration between two activities, obtained by solving an optimization problem

The solution of this problem provides a set of couples of activities that are, according to the approach, in directly-follows relationship, along with the strength of the relationship. This is the “frequency” DFG.

A “performance” DFG can be obtained by the duration matrix, keeping only the entries that appear in the solution of the problem (i.e., the couples of activities that appear in the “frequency” DFG).

This can be then visualized (using for example the PM4Py DFG visualization).

To have a “realistic” example (for which we know the “real” DFG), we can take an existing log and simply remove the case ID column .. Trying then to reconstruct the DFG without having that 🙂

Let’s try an example of that. First, we load a CSV file into a Pandas dataframe, keeping only the concept:name and the time:timestamp columns:

from pm4py.objects.log.adapters.pandas import csv_import_adapter

df = csv_import_adapter.import_dataframe_from_path("tests/input_data/receipt.csv")
df = df[["concept:name", "time:timestamp"]]

Then, we can apply the Correlation Miner approach:

from pm4py.algo.discovery.correlation_mining import algorithm as correlation_miner

frequency_dfg, performance_dfg = correlation_miner.apply(df, parameters={correlation_miner.Variants.CLASSIC.value.Parameters.ACTIVITY_KEY: "concept:name",
                                correlation_miner.Variants.CLASSIC.value.Parameters.TIMESTAMP_KEY: "time:timestamp"})

To better visualize the DFG, we can retrieve the frequency of activities

activities_freq = dict(df["concept:name"].value_counts())

And then perform the visualization of the DFG:

from pm4py.visualization.dfg import visualizer as dfg_visualizer
gviz_freq = dfg_visualizer.apply(frequency_dfg, variant=dfg_visualizer.Variants.FREQUENCY, activities_count=activities_freq, parameters={"format": "svg"})
gviz_perf = dfg_visualizer.apply(performance_dfg, variant=dfg_visualizer.Variants.PERFORMANCE, activities_count=activities_freq, parameters={"format": "svg"})

This is the view of the frequency DFG

And this is the view of the performance DFG

So, correlation mining was able to discover a visualization where the main path is clear.

Different variants of the correlation miner are available:

  • Variants.CLASSIC: calculates the P/S matrix and the duration matrix in the classic way (the entire list of events is used)
  • Variants.TRACE_BASED: calculates the P/S matrix and the duration matrix on a classic event log, trace-by-trace, and merges the results. The resolution of the linear problem permits to obtain a model that is more understandable than the classic DFG calculated on top of the log.
  • Variants.CLASSIC_SPLIT: calculates the P/S matrix and the duration matrix on the entire list of events, as in the classic version, but splits that in chunks to fasten the computation. Hence, the generated model is less accurate (in comparison to the CLASSIC version) but the calculation is faster. The default chunk size is 100000 events.