Merging Event logs with Many to Many Relationships

  • Slides: 22
Download presentation
Merging Event logs with Many to Many Relationships Lihi Raichelson and Pnina Soffer Eindhoven,

Merging Event logs with Many to Many Relationships Lihi Raichelson and Pnina Soffer Eindhoven, September 2014

Motivation and aim � Process mining requires a single event log, while many processes

Motivation and aim � Process mining requires a single event log, while many processes include procedures that use different systems and thus have separate logs �The different procedures might be called as multiple instance parts � To provide a full analysis, process mining should be applied to a log containing all relevant activities of the endto-end process flow � We present an approach for merging event logs which �is capable of dealing with all kinds of relationships between logs (one-to-one, one-to-many, many-to-one and many-to-many) �does not assume a shared Case ID �is capable of dealing with logs with unstructured free text

Example A simplified log of an ordering system (main process) Ø different organizational units

Example A simplified log of an ordering system (main process) Ø different organizational units can directly place orders for office Order Timestamp User Activity Item no Department supplies 3001 02/02/14 10: 12 Ilana open order 1234 1235 dep. 89 3001 02/02/14 10: 13 Tsvi approve order 1234 1235 dep. 89 3001 02/02/14 13: 16 Ilana check status 1234 1235 dep. 89 3001 03/02/14 16: 18 Ilana receive item 1234 dep. 89 3001 04/02/14 16: 35 Ilana receive item 1235 dep. 89 3001 04/02/14 16: 36 Ilana close order 1234 1235 dep. 89 3002 02/02/14 10: 30 Sigal Rache open order 1234 1236 dep. 79 3002 02/02/14 10: 31 approve order 1234 1236 dep. 79 3002 02/02/14 15: 31 Sigal check status 1234 1236 dep. 79 3002 03/02/14 16: 19 Sigal receive item 1234 dep. 79 3002 04/02/14 16: 35 Sigal receive item 1236 dep. 79 3002 04/02/14 16: 36 Sigal close order 1234 1236 dep. 79 l

Example A simplified log of consolidated deliveries (sub process) Ø Consolidated deliveries are received

Example A simplified log of consolidated deliveries (sub process) Ø Consolidated deliveries are received at a warehouse, where they are registered and distributed to the ordering units Delivery Timestamp User Activity Item no Department 5001 03/02/14 11: 45 Mosh receive item 1234 5001 03/02/14 16: 15 Mosh send to dep. 1234 dep. 89 5001 03/02/14 16: 16 Mosh send to dep. 1234 dep. 79 5002 04/02/14 12: 46 Mosh receive item 1235 1236 5002 04/02/14 16: 30 Mosh send to dep. 1235 dep. 89 5002 04/02/14 16: 31 Mosh send to dep. 1236 dep. 79

Example v. Both process parts employ multiple instance procedures v. The lower-level instances in

Example v. Both process parts employ multiple instance procedures v. The lower-level instances in both process parts refer to ordered items v. Grouping of items to cases is different for the main and sub-process v. In the main process, the grouping is by order (serving as case ID) v. In the sub process the grouping is by delivery (case ID), where items ordered by different departments

Merging Event Logs � We assume two logs, one of a main process and

Merging Event Logs � We assume two logs, one of a main process and one of a sub-process. � In the absence of a common case ID, the main challenge is to relate cases from the two logs to the same overall case � To handle many-to-many relationships between the cases of the different logs, we duplicate cases of the sub process and merge them with every relevant case in the main process, revealing the end-to-end process flow � Merging the logs would enable mining the end-to-end process, and particularly tracing the low-level instances (ordered items)

Merging Event Logs (a) Main log includes cases which may have multiple instances -

Merging Event Logs (a) Main log includes cases which may have multiple instances - grouped together in the sub process instances of case 1 (main) process are included in case 1 and case 2 (sub) (b) New case IDs for the merged log � Each case of the main process corresponds to a unique new case id � Cases of the sub process can be duplicated (e. g. case 1 of the sub process)

Basic assumptions We seek for matching cases in the two logs, taking the following

Basic assumptions We seek for matching cases in the two logs, taking the following assumptions: vboth logs are synchronized, with reliable and comparable timestamps vboth logs might include multiple instances vattribute values in the logs use uniform terms vattribute values may include free text vsub process log might include cases initiated by other processes For logs whose case ID is not identical, cases are matched based on: similarity of attribute values

Similarity of attribute values � Straight forward if distinct attributes hold relation as foreign

Similarity of attribute values � Straight forward if distinct attributes hold relation as foreign keys � If direct relation cannot be established (particularly with free text attributes): � Calculate content similarity, using the Term Frequency-Inverse Document Frequency (tf-idf) technique o For each case, all words are extracted to a "bag of words" o Words that are common to all cases (stop words, activity names) are eliminated o matching cases would have more common words in their attribute values than non-matching cases In the example: � main log Case 3001 bag of unique words: {3001, Ilana, Tsvi, 1234, 1235, 89} Similarity � sub log Case 5001 bag of unique words: {5001, Mosh, score 1234, = 289, 79}

Appropriate temporal relations o The sub process is triggered by the main process o

Appropriate temporal relations o The sub process is triggered by the main process o Should start after the beginning of the main process o The sub process needs to provide some feedback to the main process o Should have time overlap with the main process and cannot start after the main process has ended

Allen’s interval algebra relations MATC H

Allen’s interval algebra relations MATC H

Algorithm � Calculates match scores for every case combination that meets the temporal requirements

Algorithm � Calculates match scores for every case combination that meets the temporal requirements Generates a new case ID for each case of the main process selects the subprocess cases whose match score is maximal && above zero

Running example New ID Ref log ID Timestamp User Activity Item no Dep. 3001

Running example New ID Ref log ID Timestamp User Activity Item no Dep. 3001 -A main log 3001 02/02/14 10: 12 Ilana open order 1234 1235 dep. 89 3001 -A main log 3001 02/02/14 10: 13 Tsvi approve order 1234 1235 dep. 89 3001 -A main log 3001 02/02/14 13: 16 Ilana check status 1234 1235 dep. 89 3001 -A sub log 5001 03/02/14 11: 45 Mosh receive item 1234 3001 -A sub log 5001 03/02/14 16: 15 Mosh send to dep. 1234 dep. 89 3001 -A sub log 5001 03/02/14 16: 16 Mosh send to dep. 1234 dep. 79 3001 -A main log 3001 03/02/14 16: 18 Ilana receive item 1234 dep. 89 3001 -A sub log 5002 04/02/14 12: 46 Mosh receive item 1235 1236 3001 -A sub log 5002 04/02/14 16: 30 Mosh send to dep. 1235 dep. 89 3001 -A sub log 5002 04/02/14 16: 31 Mosh send to dep. 1236 dep. 79 3001 -A main log 3001 04/02/14 16: 35 Ilana receive item 1235 dep. 89 3001 -A main log 3001 04/02/14 16: 36 Ilana close order dep. 89

Evaluation � Evaluated by a controlled experiment using synthetic logs (a) correct match between

Evaluation � Evaluated by a controlled experiment using synthetic logs (a) correct match between cases is known in advance: allowing an accurate measurement of precision and recall of the results (b) when generating logs - a full coverage of relationship types (one-to-one up to many-tomany) and temporal relations between the logs can be ensured (c) possible to control the amount of text-related noise (additional irrelevant text) in the attribute values

Evaluation 4 logs of a main process and corresponding sub processes generated v 130

Evaluation 4 logs of a main process and corresponding sub processes generated v 130 -260 cases each, with 3 -7 events per case v Basic attributes: case ID (order number / delivery number), timestamp, resource, activity v Additional attributes: item number, department number v Three possible temporal relations between main and sub-process: (1) sub during main (2) main overlaps with sub (3) sub finishes main Logs Relationship Main Log – number of cases Sub Log – number of cases One-to-One (OTO) 130 130 One-to-Many (OTM) 130 260 Many-to-One (MTO) Many-to-Many 260 130 260 Unified log – expected number of cases

Evaluation results � All log combinations resulted in a perfect unified log with 100%

Evaluation results � All log combinations resulted in a perfect unified log with 100% recall and precision � To evaluate the ability of the algorithm to handle noisy free text, we (1) added to the main logs a free text attribute of up to 200 words (2) added to the sub logs three free text attributes with up to five words each Note: the values text was randomly True of the free False selected, and served as Logs Fnoise which should effect the positives - notpositives - matching negatives Reca Precisio relationship correctly identified incorrectly identified - incorrectly rejected ll n measure OTO 126 16 4 97% 89% 93% OTM 244 28 16 94% 90% 92% MTO 239 11 10 96% 96% MTM 501 35 15 97% 93% 95%

Evaluation The results indicate that the algorithm performs well: Ø recall of at least

Evaluation The results indicate that the algorithm performs well: Ø recall of at least 94% and precision of at least 89% Ø recall was higher than precision in most cases No substantial trade-off was observed between precision and recall Ø insignificant differences of F measure values 92%-96% The performance of the algorithm is not sensitive to the type of relationship between the logs Ø no relationship type was identified as "easier match" with superior performance over the others

Summary We propose an algorithm which �produce a unified log for all relationship types

Summary We propose an algorithm which �produce a unified log for all relationship types �where each log has non-matching case IDs �handle logs that contain unstructured and free-text data A unified log provides: �a good support to overall flow analysis �process improvement opportunities by an end-to-end rather than local view �possible tracking of flow deficiencies and gaps between different parts of the process Limitations: � Using duplication of cases supports flow analysis but not volume oriented analysis

Future research i. Additional evaluation: focus on scalability and performance, using real-life logs ii.

Future research i. Additional evaluation: focus on scalability and performance, using real-life logs ii. Using the merged logs for the end-to-end process discovery might require some specialized visual representation iii. Improving mach results through interaction with the user, who can evaluate the matching of specific cases based on domain knowledge

THANK YOU Questions?

THANK YOU Questions?

Real-life log example Case ID Date & Time Activity Resource Free Text 184065 7/25/2012

Real-life log example Case ID Date & Time Activity Resource Free Text 184065 7/25/2012 Opened 15: 11 ticket aerez 3 hey the stdudents got their unix permissions. need to open the following groups for them: mpgall mpgdsgn gsrall sklall gsr_rtl skl_rtl cpu 1270 cpu 1272_upf for users: aluz hsreter thanks amit 184065 7/25/2012 Assigned 15: 11 to tsela hey the stdudents got their unix permissions. need to open the following groups for them: mpgall mpgdsgn gsrall sklall gsr_rtl skl_rtl cpu 1270 cpu 1272_upf for users: aluz hsreter thanks amit 7/26/2012 Closed 15: 11 ticket tsela Hi Added both to mpgall mpgdsgn gsrall sklall cpu 1270_upf You will see the groups in about 30 -60 minutes you can view your groups using the 'groups $USER' command. You will have to open new sessions in order for these groups to be active in those sessions. To check if a session has those groups use the command 'groups'. For cpu 1270 please file a request here: https: //hsdhsw. intel. com/HSD/HASWELL/default. aspx#access_request/default. aspx In order to get cpu 1270 select ip_domain ? 'process' and ip_access_method 'unix' Under job_role_template select 'PDE: sch_lay access For gsr_rtl skl_rtl please open a ticket to 'i. MPV (AVL)'. Thanks! Tal Sela (Database MMG) (04) 865 6596 hey the stdudents got their unix permissions. need to open the following groups for them: mpgall mpgdsgn gsrall sklall gsr_rtl skl_rtl cpu 1270 cpu 1272_upf for users: aluz hsreter thanks amit 184065

Real-life processes � Sub process log for permission approval, contains 8, 008 instances Case

Real-life processes � Sub process log for permission approval, contains 8, 008 instances Case ID Date & Time Activity Resource User Permission Approved 8 a 08 b 7 fa-7731 -4 b 2 c-8 a 8889 cdb 744308 a 7/25/2012 5: 12 open request tsela aluz ec amr unix sklall TRUE 8 a 08 b 7 fa-7731 -4 b 2 c-8 a 8889 cdb 744308 a 7/25/2012 5: 14 COMPLETED tsela aluz ec amr unix sklall TRUE 8 a 08 b 7 fa-7731 -4 b 2 c-8 a 8889 cdb 744308 a 7/25/2012 5: 15 send mail tsela aluz ec amr unix sklall TRUE 79 dd 3 fef-1300 -401 a-8 a 0 a-6 d 36 ae 751 ee 1 7/25/2012 5: 16 open request tsela hsreter ec amr unix sklall TRUE 79 dd 3 fef-1300 -401 a-8 a 0 a-6 d 36 ae 751 ee 1 7/25/2012 5: 17 COMPLETED tsela hsreter ec amr unix sklall TRUE 79 dd 3 fef-1300 -401 a-8 a 0 a-6 d 36 ae 751 ee 1 7/25/2012 5: 18 send mail tsela hsreter ec amr unix sklall TRUE