Big Dataframe management

PM4Py integrates Pandas to manage in an efficient way the CSV log files. In this page, some solutions to ensure a greater scalability of the Process Mining operations are provided.

Loading

Pandas needs two main information about the CSV file:

  • The separator (default is comma , )
  • The quote character (default is None )

The faster possible option is to load the CSV without preliminarly transforming the timestamp columns (slow operation). This offers mainly three advantages:

  • It permits to filter out some rows from the CSV, making it smaller
  • It permits to specify which timestamp columns (if there are many) is useful to convert (because in the operations only some may help)
  • It permits to specify a format for the timestamp column, speeding up things 5x – 10x

And could be done using the following piece of code:

from pm4py.objects.log.adapters.pandas import csv_import_adapter

df = csv_import_adapter.import_dataframe_from_path_wo_timeconversion("C:\\receipt.csv", sep=',', quotechar=None)

If the CSV file is huge, it is advised to load a smaller subset of rows (for example 1000) through the parameter nrows.

After loading, it is always useful to store the correspondence of the main attributes with the columns of the CSV, for example if we want to specify case:concept:name as column hosting the Case ID, we want to specify concept:name as column hosting the activity and time:timestamp as column hosting the timestamp, the following code could be used:

from pm4py.util import constants
CASEID_GLUE = "case:concept:name"
ACTIVITY_KEY = "concept:name"
TIMEST_KEY = "time:timestamp"
parameters = {constants.PARAMETER_CONSTANT_CASEID_KEY: CASEID_GLUE,
                        constants.PARAMETER_CONSTANT_ACTIVITY_KEY: ACTIVITY_KEY}

Fast preliminary filtering

A log extracted from an Information System may contain hundreds, if not thousands, of different activities, sometimes with a very low occurrence. In a process model, only is useful to display a small subset (for example, 20), of them. Then, it may be useful to keep event records that are related to the most frequent activities. This could be done with the following code:

from pm4py.algo.filtering.pandas.attributes import attributes_filter

df = attributes_filter.filter_df_keeping_spno_activities(df, activity_key=ACTIVITY_KEY, max_no_activities=10)

Other types of filtering, that can be applied to the Pandas dataframe, are described in the Filtering section.

Conversion of the timestamp columns

When the import has been done and the filtering has been done, to perform Process Mining analysis it is important to do the conversion of the timestamp columns from string to a datetime format. This is important to measure the performance between the activities.

Along that, it is possible to specify a timestamp format to speed up the transformation up to 5x-10x.

The following code helps to specify the timestamp columns and the format:

TIMEST_COLUMNS = ["time:timestamp"]
TIMEST_FORMAT = "%Y-%m-%d %H:%M:%S"

Then, the conversion could be done using the following code:

df = csv_import_adapter.convert_timestamp_columns_in_df(df, timest_columns=TIMEST_COLUMNS,
                                                               timest_format=TIMEST_FORMAT)

Discovering the number of occurrences of the activities and the Directly-Follows Graph from the Pandas dataframe

The following code helps to retrieve the number of occurrences of the activities in the Pandas dataframe:

from pm4py.algo.filtering.pandas.attributes import attributes_filter

activities_count = attributes_filter.get_attribute_values(df, attribute_key=ACTIVITY_KEY)

The following code helps to retrieve both the DFG (Directly-Follows Graph) for the frequency and the performance (taking as aggregation metric the median):

from pm4py.algo.discovery.dfg.adapters.pandas import df_statistics

[dfg_frequency, dfg_performance] = df_statistics.get_dfg_graph(df, measure="both",
                                                              perf_aggregation_key="median",
                                                               case_id_glue=CASEID_GLUE, activity_key=ACTIVITY_KEY,
                                                               timestamp_key=TIMEST_KEY)

If the number of elements of the Directly-Follows graph is too high, then the following instructions could help to remove the elements that are below a noise threshold:

from pm4py.algo.filtering.dfg import dfg_filtering

dfg_frequency = dfg_filtering.apply(dfg_frequency, {"noiseThreshold": 0.03})

From the Directly-Follows graph, the IMDFa algorithm could be applied to discover a Petri net:

from pm4py.algo.discovery.inductive import factory as inductive_miner

net, initial_marking, final_marking = inductive_miner.apply_dfg(dfg_frequency)

Discovery of aggregated statistics on the arcs of the Petri net without the need of a replay

In PM4Py, a greedy method to annotate the Petri net with the performance is provided. The shortest paths between visible transitions of the Petri net are discovered (through BFS), and the following code could be used:

from pm4py.visualization.petrinet.util.vis_trans_shortest_paths import get_shortest_paths

spaths = get_shortest_paths(net)

Then, the aggregated statistics regarding the Petri net could be retrieved using the following code:

from pm4py.visualization.petrinet.util.vis_trans_shortest_paths import get_decorations_from_dfg_spaths_acticount

aggregated_statistics = get_decorations_from_dfg_spaths_acticount(net, dfg_performance,
                                                                  spaths,
                                                                  activities_count,
                                                                  variant="performance")

And a representation of a decorated Petri net could be obtained:

from pm4py.visualization.petrinet import factory as pn_vis_factory

gviz = pn_vis_factory.apply(net, initial_marking, final_marking, variant="performance",
                            aggregated_statistics=aggregated_statistics, parameters={"format": "svg"})

.