Filtering Pandas dataframes

Filtering is one of the most important techniques in Process Mining as it permits to retrieve a smaller part of the dataframe that contains the information we need to use/exploit.

In this page, a collection of techniques to filter a Pandas dataframe object is reported. Some techniques has been implemented to filter the dataframe:

  • Keeping/Removing the traces (so, all the events belonging to the traces) by a criteria
  • Trimming traces, keeping/removing events according to a specified criteria

To import a Pandas dataframe object in PM4Py, the following code could be used:

import os
from pm4py.objects.log.adapters.pandas import csv_import_adapter

dataframe = csv_import_adapter.import_dataframe_from_path(os.path.join("tests","input_data","receipt.csv"))
print(len(dataframe))
print(len(dataframe.groupby("case:concept:name"))

Filtering on timeframe

A timeframe specifies a time interval that should be respected by traces or events. PM4Py provides the following filters on the timeframe.

Filtering traces contained in the timeframe

The following code could be used on the receipt.csv log, that filters traces contained in the interval between 09 March 2011 and 18 January 2012

from pm4py.algo.filtering.pandas.timestamp import timestamp_filter

df_timest_contained = timestamp_filter.filter_traces_contained(dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
print(len(df_timest_contained))
print(len(df_timest_contained.groupby("case:concept:name")))

We get that while the original log contained 1434 traces, the filtered log contains only 959 traces.

Filtering traces intersecting the timeframe

The following code could be used on the receipt.xes log, that filters traces intersecting the interval between 09 March 2011 and 18 January 2012

from pm4py.algo.filtering.pandas.timestamp import timestamp_filter

df_timest_intersecting = timestamp_filter.filter_traces_intersecting(dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
print(len(df_timest_intersecting))
print(len(df_timest_intersecting.groupby("case:concept:name")))

We get that while the original log contained 1434 traces, the filtered log contains only 975 traces.

Filtering events contained in the timeframe

In this case, we want to keep all the events contained in the timeframe, without preserving trace integrity. We can use the following code:

from pm4py.algo.filtering.pandas.timestamp import timestamp_filter

df_timest_events = timestamp_filter.apply_events(dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
print(len(df_timest_events))
print(len(df_timest_events.groupby("case:concept:name")))

Filtering on case performance

This filter permits to keep only traces in the Pandas dataframe with duration that is inside a specified interval.

The following code applies a filter on case performance keeping only cases with duration between 1 day and 10 days:

from pm4py.algo.filtering.pandas.cases import case_filter

df_cases = case_filter.filter_on_case_performance(dataframe, min_case_performance=86400, max_case_performance=864000)
print(len(df_cases))
print(len(df_cases.groupby("case:concept:name")))

On the receipt.csv log, the number of traces satisfying this criteria is 296.

Filtering on start activities

This filter permits to keep only traces in the Pandas dataframe with a start activity among a set of specified activities.

To retrieve the list of start activities in the dataframe, the following code could be used:

from pm4py.algo.filtering.pandas.start_activities import start_activities_filter

log_start = start_activities_filter.get_start_activities(dataframe)
print(log_start)

In this case, we get the following dictionary reporting the start activities and the number of occurrences:

{'Confirmation of receipt': 1434})

To apply a filter (even if useless in this case) the following code could be used:

df_start_activities = start_activities_filter.apply(dataframe, ["Confirmation of receipt"])
print(len(df_start_activities))
print(len(df_start_activities.groupby("case:concept:name")))

It is possible to keep automatically the most frequent start activities, using the apply_auto_filter function. The function accept as parameter a decreasingFactor (by default equal to 0.6), and the working could be explained by an example. Suppose to have a log with the following start activities, ordered by number of occurrences:

  1. A with number of occurrences 1000
  2. B with number of occurrences 700
  3. C with number of occurrences 300
  4. D with number of occurrences 50

The most frequent start activity (A with 1000 occurrences) is kept surely by the method. Then, the number of occurrences of B is compared with the number of occurrences of A: occ(B)/occ(A) = 0.7. If decreasingFactor=0.6, then B is also kept as admissible start activity (because occ(B)/occ(A) > 0.6); if decreasingFactor=0.8 then B is not kept as admissible end activity and the method stops here. Then, if B is accepted, also C is looked. We have occ(C)/occ(B) = 0.43. If decreasingFactor=0.6, then C is not accepted as admissible start activity and the method stops here.

Depending on the value of the decreasing factor, we have the following set of admitted start activities:

  • decreasingFactor = 0.8 => {A}
  • decreasingFactor = 0.6 => {A,B}
  • decreasingFactor = 0.6 => {A,B,C}

To apply a filter (and print the admitted start activities) the following code could be used:

df_auto_sa = start_activities_filter.apply_auto_filter(dataframe, parameters={"decreasingFactor": 0.6})
print(start_activities_filter.get_start_activities(df_auto_sa))
print(len(df_auto_sa))
print(len(df_auto_sa.groupby("case:concept:name")))

We get the following admitted start activities:

{'Confirmation of receipt': 1434})

Filtering on end activities

This filter permits to keep only traces in the Pandas dataframe with a end activity among a set of specified activities.

To retrieve the list of end activities in the log, the following code could be used:

from pm4py.algo.filtering.pandas.end_activities import end_activities_filter

log_end = end_activities_filter.get_end_activities(dataframe)
print(log_end)

In this case, we get the following dictionary reporting the end activities and the number of occurrences:

{'T10 Determine necessity to stop indication': 828, 'T05 Print and send confirmation of receipt': 400, 'Confirmation of receipt': 116, 'T15 Print document X request unlicensed': 39, 'T06 Determine necessity of stop advice': 16, 'T20 Print report Y to stop indication': 15, 'T02 Check confirmation of receipt': 8, 'T11 Create document X request unlicensed': 4, 'T03 Adjust confirmation of receipt': 2, 'T04 Determine confirmation of receipt': 2, 'T07-1 Draft intern advice aspect 1': 1, 'T13 Adjust document X request unlicensed': 1, 'T07-2 Draft intern advice aspect 2': 1, 'T07-5 Draft intern advice aspect 5': 1})

To apply a filter the following code could be used:

df_end_activities = end_activities_filter.apply(dataframe, ["T05 Print and send confirmation of receipt", "T10 Determine necessity to stop indication"])
print(len(df_end_activities))
print(len(df_end_activities.groupby("case:concept:name")))

In this case, if we print the length of the filtered dataframe, it results that 1228 traces of the original 1434 are kept.

It is possible to automatically filter the dataframe to keep the most frequent end activities in the log through the apply_auto_filter function that accepts as parameter a decreasingFactor (default 0.6; see the start activities filter for an explanation).

The following code filters the log keeping only the most frequent end activities:

df_auto_ea = end_activities_filter.apply_auto_filter(dataframe, parameters={"decreasingFactor": 0.6})
print(end_activities_filter.get_end_activities(df_auto_ea))
print(len(df_auto_ea))
print(len(df_auto_ea.groupby("case:concept:name")))

We get the following admitted end activities:

{'T10 Determine necessity to stop indication': 828})

Filtering on variants

A variant is a set of cases that share the same control-flow perspective, so a set of casesthat share the same classified events (activities) in the same order.

To get the list of variants contained in a given dataframe, the following code could be used:

from pm4py.algo.filtering.pandas.variants import variants_filter

variants = variants_filter.get_variants_df(df)

This expressed as a dictionary having as key the variant and as value the list of cases that share the variant. If the number of occurrences of the variants is of interest, the following code retrieves a list of variants along with their count:

from pm4py.statistics.traces.pandas import case_statistics

variants_count = case_statistics.get_variant_statistics(df)
variants_count = sorted(variants_count, key=lambda x: x['case:concept:name'], reverse=True)
print(variants_count)

Obtaining the following output:

[{'variant': 'Confirmation of receipt,T02 Check confirmation of receipt,T04 Determine confirmation of receipt,T05 Print and send confirmation of receipt,T06 Determine necessity of stop advice,T10 Determine necessity to stop indication', 'count': 713}, {'variant': 'Confirmation of receipt,T06 Determine necessity of stop advice,T10 Determine necessity to stop indication,T02 Check confirmation of receipt,T04 Determine confirmation of receipt,T05 Print and send confirmation of receipt', 'count': 123}, {'variant': 'Confirmation of receipt', 'count': 116}, {'variant': 'Confirmation of receipt,T02 Check confirmation of receipt,T06 Determine necessity of stop advice,T10 Determine necessity to stop indication,T04 Determine confirmation of receipt,T05 Print and send confirmation of receipt', 'count': 115}, ...

The dataframe could be filtered:

filtered_df1 = variants_filter.apply(df, ["Confirmation of receipt,T02 Check confirmation of receipt,T04 Determine confirmation of receipt,T05 Print and send confirmation of receipt,T06 Determine necessity of stop advice,T10 Determine necessity to stop indication"])

And the variants of the dataframe could be checked:

variants_count_filtered_df1 = case_statistics.get_variant_statistics(filtered_df1)
print(variants_count_filtered_df1)

Obtaining:

[{'variant': 'Confirmation of receipt,T02 Check confirmation of receipt,T04 Determine confirmation of receipt,T05 Print and send confirmation of receipt,T06 Determine necessity of stop advice,T10 Determine necessity to stop indication', 'count': 713}]

Suppose instead you want to filter out (remove) the most common variant. The following code could be used:

filtered_df2 = variants_filter.apply(df, ["Confirmation of receipt,T02 Check confirmation of receipt,T04 Determine confirmation of receipt,T05 Print and send confirmation of receipt,T06 Determine necessity of stop advice,T10 Determine necessity to stop indication"], parameters={"positive": False})

And the variants checked as above, obtaining:

[{'variant': 'Confirmation of receipt,T06 Determine necessity of stop advice,T10 Determine necessity to stop indication,T02 Check confirmation of receipt,T04 Determine confirmation of receipt,T05 Print and send confirmation of receipt', 'count': 123}, {'variant': 'Confirmation of receipt', 'count': 116}, ...

A filter to keep automatically the most common variants could be applied through the apply_auto_filter method. This method accepts a parameter called decreasingFactor (default value is 0.6; further details are provided in the start activities filter).

auto_filtered_log = variants_filter.apply_auto_filter(df)

Filtering on attributes values

Filtering on attributes values permits alternatively to:

  • Keep cases that contains at least an event with one of the given attribute values
  • Remove cases that contains an event with one of the the given attribute values
  • Keep events (trimming traces) that have one of the given attribute values
  • Remove events (trimming traces) that have one of the given attribute values

Example of attributes are the resource (generally contained in org:resource attribute) and the activity (generally contained in concept:name attribute).

To get the list of resources and activities contained in the dataframe, the following code could be used:

from pm4py.algo.filtering.pandas.attributes import attributes_filter

activities = attributes_filter.get_attribute_values(df, attribute_key="concept:name")
resources = attributes_filter.get_attribute_values(df, attribute_key="org:resource")

Retrieving this list of resources (on the Receipt log):

{'Resource01': 1228, 'Resource02': 580, 'Resource03': 552, 'Resource04': 483, 'Resource05': 445, 'Resource06': 430, 'Resource07': 424, 'Resource08': 356, 'admin1': 352, 'Resource09': 350, 'Resource10': 329, 'Resource11': 328, 'Resource12': 326, 'Resource13': 307, 'Resource14': 264, 'Resource15': 235, 'Resource16': 215, 'Resource17': 194, 'Resource18': 170, 'admin2': 160, 'Resource19': 136, 'Resource20': 120, 'Resource21': 104, 'Resource22': 80, 'Resource23': 78, 'Resource24': 50, 'Resource25': 49, 'Resource26': 44, 'Resource27': 43, 'Resource28': 30, 'Resource29': 20, 'Resource30': 13, 'Resource31': 12, 'Resource32': 11, 'Resource33': 11, 'Resource34': 10, 'Resource35': 8, 'Resource36': 6, 'Resource37': 5, 'test': 5, 'admin3': 3, 'Resource38': 3, 'TEST': 2, 'Resource39': 2, 'Resource40': 1, 'Resource41': 1, 'Resource43': 1, 'Resource42': 1}

And this list of activities:

{'Confirmation of receipt': 1434, 'T06 Determine necessity of stop advice': 1416, 'T02 Check confirmation of receipt': 1368, 'T04 Determine confirmation of receipt': 1307, 'T05 Print and send confirmation of receipt': 1300, 'T10 Determine necessity to stop indication': 1283, 'T03 Adjust confirmation of receipt': 55, 'T07-1 Draft intern advice aspect 1': 45, 'T11 Create document X request unlicensed': 44, 'T12 Check document X request unlicensed': 41, 'T15 Print document X request unlicensed': 39, 'T14 Determine document X request unlicensed': 39, 'T07-2 Draft intern advice aspect 2': 32, 'T07-5 Draft intern advice aspect 5': 27, 'T17 Check report Y to stop indication': 26, 'T20 Print report Y to stop indication': 20, 'T19 Determine report Y to stop indication': 20, 'T16 Report reasons to hold request': 20, 'T08 Draft and send request for advice': 18, 'T07-3 Draft intern advice hold for aspect 3': 8, 'T09-3 Process or receive external advice from party 3': 8, 'T09-1 Process or receive external advice from party 1': 7, 'T07-4 Draft internal advice to hold for type 4': 6, 'T18 Adjust report Y to stop indicition': 6, 'T09-4 Process or receive external advice from party 4': 5, 'T13 Adjust document X request unlicensed': 2, 'T09-2 Process or receive external advice from party 2': 1}

To filter traces containing/not containing a given list of resources, the following code could be used:

from pm4py.util import constants
df_traces_pos = attributes_filter.apply(df, ["Resource10"],
                                          parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "org:resource", "positive": True})
df_traces_neg = attributes_filter.apply(df, ["Resource10"],
                                          parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "org:resource", "positive": False})

To filter events (trimming traces) the following code could be used:

df_events = attributes_filter.apply_events(df, ["Resource10"],
                                          parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "org:resource", "positive": True})

To apply automatically a filter on events attributes (trimming traces and keeping only events containing the attribute with a frequent value), the apply_auto_filter method is provided. The method accepts as parameters the attribute name and the decreasingFactor (default 0.6; an explanation could be found on the start activities filter). Example:

from pm4py.algo.filtering.pandas.attributes import attributes_filter
from pm4py.util import constants
filtered_df = attributes_filter.apply_auto_filter(df, parameters={
    constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "concept:name", "decreasingFactor": 0.6})

Filter on numeric attribute values

Filtering on numeric attribute values provide options that are similar to filtering on string attribute values (that we already considered). We can see an example. We can import the roadtraffic100traces.csv log file.

import os
from pm4py.objects.log.adapters.pandas import csv_import_adapter

df = csv_import_adapter.import_dataframe_from_path(os.path.join("tests", "input_data", "roadtraffic100traces.csv"))

The following filter (on events) help to keep only the events satisfying an amount comprised between 34 and 36.

from pm4py.algo.filtering.pandas.attributes import attributes_filter
from pm4py.util import constants
filtered_df_events = attributes_filter.apply_numeric_events(df, 34, 36,
                                             parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "amount"})
print(len(filtered_df_events))

Similarly, the following filter on cases helps to keep only cases with at least an event satisfying the specified amount:

filtered_df_cases = attributes_filter.apply_numeric(df, 34, 36,
                                             parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "amount"})
print(len(filtered_df_cases))

The filter on cases provide the option to specify up to two attributes that are checked on the events that shall satisfy the numeric range. For example, if we are interested in cases having an event with activity Add penalty that has an amount between 34 and 500, the following code helps:

filtered_df_cases = attributes_filter.apply_numeric(df, 34, 500,
                                             parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "amount",
                                                         "stream_filter_key1": "concept:name",
                                                         "stream_filter_value1": "Add penalty"})
print(len(filtered_df_cases))

.