Feature Selection on Event Logs

Feature Selection on Event Logs

An operation of feature selection permits to represent the event log in a tabular way. This is important for operations such as prediction and anomaly detection.

Automatic Feature Selection

In PM4Py, we offer ways to perform an automatic feature selection. As example, let us import the receipt log and perform an automatic feature selection on top of it.

First, we import the receipt log:

from pm4py.objects.log.importer.xes import importer as xes_importer

log = xes_importer.apply("tests/input_data/receipt.xes")

Then, let’s perform the automatic feature selection:

from pm4py.objects.log.util import get_log_representation

data, feature_names = get_log_representation.get_default_representation(log)
print(feature_names)

Printing the value feature_names, we see that the following attributes were selected:

  • The attribute channel at the trace level (this assumes values Desk, Intern, Internet, Post, e-mail)
  • The attribute department at the trace level (this assumes values Customer contact, Experts, General)
  • The attribute group at the event level (this assumes values EMPTY, Group 1, Group 12, Group 13, Group 14, Group 15, Group 2, Group 3, Group 4, Group 7).

No numeric attribute value is selected.

If we print feature_names, we get the following representation:

[‘trace:channel@Desk’, ‘trace:channel@Intern’, ‘trace:channel@Internet’, ‘trace:channel@Post’, ‘trace:channel@e-mail’, ‘trace:department@Customer contact’, ‘trace:department@Experts’, ‘trace:department@General’, ‘event:org:group@EMPTY’, ‘event:org:group@Group 1’, ‘event:org:group@Group 12’, ‘event:org:group@Group 13’, ‘event:org:group@Group 14’, ‘event:org:group@Group 15’, ‘event:org:group@Group 2’, ‘event:org:group@Group 3’, ‘event:org:group@Group 4’, ‘event:org:group@Group 7’]

So, we see that we have different features for different values of the attribute. This is called one-hot encoding. Actually, a case is assigned to 0 if it does not contain an event with the given value for the attribute; a case is assigned to 1 if it contains at least one event with the attribute.

If we represent the features as a dataframe:

import pandas as pd
df = pd.DataFrame(data, columns=feature_names)
print(df)

We can see the features assigned to each different case:

[‘trace:channel@Desk’, ‘trace:channel@Intern’, ‘trace:channel@Internet’, ‘trace:channel@Post’, ‘trace:channel@e-mail’, ‘trace:department@Customer contact’, ‘trace:department@Experts’, ‘trace:department@General’, ‘event:org:group@EMPTY’, ‘event:org:group@Group 1’, ‘event:org:group@Group 12’, ‘event:org:group@Group 13’, ‘event:org:group@Group 14’, ‘event:org:group@Group 15’, ‘event:org:group@Group 2’, ‘event:org:group@Group 3’, ‘event:org:group@Group 4’, ‘event:org:group@Group 7’]
trace:channel@Desk trace:channel@Intern … event:org:group@Group 4 event:org:group@Group 7
0 0 0 … 1 0
1 0 0 … 1 0
2 0 0 … 1 0
3 0 0 … 1 0
4 0 0 … 1 0
… … … … … …
1429 0 0 … 1 0
1430 0 0 … 1 0
1431 0 0 … 1 0
1432 0 0 … 0 0
1433 0 0 … 1 0

[1434 rows x 18 columns]

Manual feature selection

The manual feature selection permits to specify which attributes should be included in the feature selection. These may include for example:

  • The activities performed in the process execution (contained in the event attribute concept:name ).
  • The resources that perform the process execution (contained in the event attribute org:resource ).
  • Some numeric attributes, at discretion of the user.

To do so, we have to call the method get_log_representation .

The types of features that can be considered by a manual feature selection are:

  • (str_ev_attr) String attributes at the event level: these are hot-encoded into features that may assume value 0 or value 1.
  • (str_tr_attr) String attributes at the trace level: these are hot-encoded into features that may assume value 0 or value 1.
  • (num_ev_attr) Numeric attributes at the event level: these are encoded by including the last value of the attribute among the events of the trace.
  • (num_tr_attr) Numeric attributes at trace level: these are encoded by including the numerical value.
  • (str_evsucc_attr) Successions related to the string attributes values at the event level: for example, if we have a trace [A,B,C], it might be important to include not only the presence of the single values A, B and C as features; but also the presence of the directly-follows couples (A,B) and (B,C).

Let’s consider for example a feature selection where we are interested to:

  • If a process execution contains, or not, an activity.
  • If a process execution contains, or not, a resource.
  • If a process execution contains, or not, a directly-follows path between different activities.
  • If a process execution contains, or not, a directly-follows path between different resources.
from pm4py.objects.log.util import get_log_representation

data, feature_names = get_log_representation.get_representation(log, str_ev_attr=["concept:name", "org:resource"],
                                                                str_tr_attr=[], num_ev_attr=[], num_tr_attr=[],
                                                                str_evsucc_attr=["concept:name", "org:resource"])
print(len(feature_names))

We see that the number of features is way bigger in this setting.

Calculating useful features

Other features are for example the ones described in http://pm4py.pads.rwth-aachen.de/incremental-calculation-of-cycle-and-lead-time/

Here, we may suppose to have:

  • A log with lifecycles, where each event is instantaneous
  • OR an interval log, where events may be associated to two timestamps (start and end timestamp).

The lead/cycle time can be calculated on top of interval logs. If we have a lifecycle log, we need to convert that with:

from pm4py.objects.log.util import interval_lifecycle
log = interval_lifecycle.to_interval(log)

Then, features such as the lead/cycle time can be inserted through the instructions:

from pm4py.objects.log.util import interval_lifecycle
from pm4py.util import constants

log = interval_lifecycle.assign_lead_cycle_time(log, parameters={
    constants.PARAMETER_CONSTANT_START_TIMESTAMP_KEY: "start_timestamp",
    constants.PARAMETER_CONSTANT_TIMESTAMP_KEY: "time:timestamp"})

After the provision of the start timestamp attribute (in this case, start_timestamp) and of the timestamp attribute (in this case, time:timestamp), the following features are returned by the method:

  • @@approx_bh_partial_cycle_time => incremental cycle time associated to the event (the cycle time of the last event is the cycle time of the instance)
  • @@approx_bh_partial_lead_time => incremental lead time associated to the event
  • @@approx_bh_overall_wasted_time => difference between the partial lead time and the partial cycle time values
  • @@approx_bh_this_wasted_time => wasted time ONLY with regards to the activity described by the ‘interval’ event
  • @@approx_bh_ratio_cycle_lead_time => measures the incremental Flow Rate (between 0 and 1).

These are all numerical attributes, hence we can refine the feature extraction by doing:

from pm4py.objects.log.util import get_log_representation

data, feature_names = get_log_representation.get_representation(log, str_ev_attr=["concept:name", "org:resource"],
                                                                str_tr_attr=[],
                                                                num_ev_attr=["@@approx_bh_partial_cycle_time",
                                                                             "@@approx_bh_partial_lead_time",
                                                                             "@@approx_bh_overall_wasted_time",
                                                                             "@@approx_bh_this_wasted_time",
                                                                             "@approx_bh_ratio_cycle_lead_time"],
                                                                num_tr_attr=[],
                                                                str_evsucc_attr=["concept:name", "org:resource"])

PCA – Reducing the number of features

Some techniques (such as the clustering, prediction, anomaly detection) suffer if the dimensionality of the dataset is too high. Hence, a dimensionality reduction technique (as PCA) helps to cope with the complexity of the data.

Having a Pandas dataframe out of the features extracted from the log:

import pandas as pd

df = pd.DataFrame(data, columns=feature_names)

It is possible to reduce the number of features using a techniques like PCA.

Let’s create the PCA with a number of components equal to 5, and apply the PCA to the dataframe.

from sklearn.decomposition import PCA

pca = PCA(n_components=5)
df2 = pd.DataFrame(pca.fit_transform(df))

The dataframe becomes:

0 1 2 3 4
0 1.429433e+06 -186480.814596 1.816337 0.036157 0.608428
1 -1.988317e+05 -8954.037209 1.748554 -0.517529 -0.624937
2 -1.992730e+05 -9087.073388 -0.618927 0.457348 -0.059923
3 -1.994546e+05 -9188.006038 -0.433247 0.729482 -0.099950
4 -1.988642e+05 -9252.275829 -0.663476 0.462576 -0.162645
… … … … … …
1429 -1.994546e+05 -9188.006038 -0.433247 0.729482 -0.099950
1430 -1.993703e+05 -9197.187437 -0.618920 0.457280 -0.059915
1431 -1.990913e+05 -8986.140737 -0.618922 0.457428 -0.059927
1432 -1.993703e+05 -9197.187436 -0.585567 0.382993 -0.061584
1433 -1.993703e+05 -9197.187436 -0.599801 0.384898 -0.044923

[1434 rows x 5 columns]

So, from more than 400 columns, we pass to 5 columns that contains most of the variance.

Anomaly Detection

In this section, we consider the calculation of an anomaly score for the different cases. This is based on the features extracted; and to work better requires the application of a dimensionality reduction technique (such as the PCA in the previous section).

Let’s apply a method called IsolationForest to the dataframe. This permits to add a column scores that is lower or equal than 0 when the case needs to be considered anomalous, and is greater than 0 when the case needs not to be considered anomalous.

from sklearn.ensemble import IsolationForest
model=IsolationForest()
model.fit(df2)
df2["scores"] = model.decision_function(df2)

To see which cases are more anomalous, we can sort the dataframe inserting an index. Then, the print will show which cases are more anomalous:

df2["@@index"] = df2.index
df2 = df2[["scores", "@@index"]]
df2 = df2.sort_values("scores")
print(df2)

We see that the following cases are considered as more anomalous:

scores @@index
706 -0.288078 706
745 -0.243753 745
362 -0.238414 362
73 -0.216372 73
1076 -0.215097 1076
… … …
1376 0.164391 1376
122 0.164391 122
149 0.164391 149
1047 0.164530 1047
176 0.164530 176

[1434 rows x 2 columns]

Instead, the cases at the bottom are considered as less anomalous.

If we execute:

print([(x["concept:name"], x["org:resource"]) for x in log[706]])
print([(x["concept:name"], x["org:resource"]) for x in log[745]])

We get:

[(‘Confirmation of receipt’, ‘Resource21’), (‘T02 Check confirmation of receipt’, ‘Resource21’), (‘T04 Determine confirmation of receipt’, ‘Resource21’), (‘T05 Print and send confirmation of receipt’, ‘admin1’), (‘T06 Determine necessity of stop advice’, ‘admin2’)]
[(‘Confirmation of receipt’, ‘Resource21’), (‘T06 Determine necessity of stop advice’, ‘admin2’), (‘T02 Check confirmation of receipt’, ‘admin2’)]

Instead, if we execute:

print([(x["concept:name"], x["org:resource"]) for x in log[1047]])
print([(x["concept:name"], x["org:resource"]) for x in log[176]])

We get:

[(‘Confirmation of receipt’, ‘Resource16’), (‘T02 Check confirmation of receipt’, ‘Resource16’), (‘T04 Determine confirmation of receipt’, ‘Resource16’), (‘T05 Print and send confirmation of receipt’, ‘Resource16’), (‘T06 Determine necessity of stop advice’, ‘Resource16’), (‘T10 Determine necessity to stop indication’, ‘Resource16’)]
[(‘Confirmation of receipt’, ‘Resource16’), (‘T02 Check confirmation of receipt’, ‘Resource16’), (‘T04 Determine confirmation of receipt’, ‘Resource16’), (‘T05 Print and send confirmation of receipt’, ‘Resource16’), (‘T06 Determine necessity of stop advice’, ‘Resource16’), (‘T10 Determine necessity to stop indication’, ‘Resource16’)]

So these are judged to be the strangest executions, judging only by the couple of attributes activity, resource.

.