Process Mining
Suppose we have a log of our process, but do not have the model of this process
- the process mining is a way to find a process from the logs
- the model can be a Workflow Net, YAWL or BPMN model.
Definitions
A process mining algorithm is a function that
- maps an event log $L$ to some workflow model
- ideally the model should be sound
play-in = process discovery
- event logs $\to$ workflow model
-
the ability to rediscover
- is a property of a process mining algorithm to discover a model of some process
- from logs that have been generated from this model
- i.e. in case of Petri Nets, if $N$ is the original model and $N'$ is the discovered model, then $N \equiv N'$
Some Notes
Usually, first a Petri Net model is discovered
- and then this model is converted to YAWL or BPMN
- because Petri Nets are simpler
Measures
There are four conflicting criteria
- Fitness
- the discovered network should allow the behavior seen in the logs
- Precision
- the discovered network should not allow the behavior not seen in the logs
- too precise $\to$ bad generalization
- Generalization
- the discovered model should generalize the behavior seen in the logs
- Simplicity
- it should be as simple as possible
- too simple $\to$ low fitness
- The main challenge of Process Mining is that all these criteria are conflicting:
- It's really hard to simultaneously satisfy all of them
- this makes Process Mining to be a Multi-Objective Optimization problem
Fitness
Can we replay the log?
- we try to replay each sequences from the logs
- so for each we see if this sequences is allowed
- if each sequence can be replayed - 100% fitness
- special notion for fitness is also defined for the Genetic Process Miner
Precision
Do we underfit the log?
- play out the model, capture logs
- see what's generated and compare to the original logs
- is it far? (need to have some distance measure)
if produced logs $\subseteq$ original logs
- then we have 100% precision
Generalization
Do we overfit the log?
Simplicity
The simpler the model - the better
- may add some penalty for each added place
- (the more places we add, the more complex the model becomes)
Examples
The Flower Model
- this model in at the one end of spectrum
- very simple
- good fitness: every possible trace can be reproduced
- generalization: allows a lot of behavior
- but precision is very low: it allows a lot of behavior not seen in the logs
The Enumerating Model
- another end of the spectrum
- behavior only seen in the logs: not more, not less
- good fitness and good precision
- no generalization at all
- no simplicity
Conclusions
So Process Mining is difficult
- it's a Multi-Objective Optimization problem
- there are no negative examples
- the search space is too complex
- logs typically show only a fraction of possible behavior
Algorithms
These algorithms let you find a Petri Net from logs
Sources