Apply Machine Learning Techniques in Data Quality Management










- Slides: 10
Apply Machine Learning Techniques in Data Quality Management Processes Jennifer Yang September 2019 1 Strata Data Conference - New York City
Contents Ø Opportunities and Hypothesis Ø Data Quality Management Process Ø Validation Rules Tested Ø Results Ø Benchmark Ø Supervised Learning Ø Unsupervised Learning Ø Lessons Learned Ø Recommendations 2 Strata Data Conference - New York City
Opportunity and Hypothesis Ø Challenge / Opportunity Ø Large amount of data from different sources without metadata and lineage information requires data quality control Ø Conventional rule based data quality management methodology requires significant number of data subject matter experts to define, implement, and maintain the data validation rules Ø It is costly, poorly scalable, and slow time-to-market Ø Hypothesis Machine learning could be explored to automatically propose the validation rules, learn the exceptions vs. non exceptions without needing to employ large number of business and technical experts to define and implement the rules 3 Strata Data Conference - New York City
Data Quality Management Ø Conventional Data Quality Management Processes Ø Data in motion data validations check the data integrity while data is being transmitted from the source to target. These validations usually can be automated via system processes Ø Data at rest data validations Ø Business data validation requires subject matter experts to define the validation rules Ø technical data validation can be automated via system processes Ø Technology team implements the data validation rules Ø Data validation results / reports are generated based on the data quality rules defined Ø Data analysts monitor and remediate the identified data issues Ø Application of Machine Learning Techniques in Data Quality Management Ø Target areas - business data validations rules definition Ø Output of the supervised learning model – data anomalies Ø Output of the unsupervised learning models – proposed validation rules and the data anomalies associated with the proposed rules Ø Increase explainability Ø Increase transparency and acceptance of the new methodology Ø Facilitate the data issue investigation 4 Strata Data Conference - New York City
Data, Validation Rules, and Results Data Used for POC: 1. 1 million rows, 87 columns Validation Rules Tested: Num Rule Category 1 Dependencies 2 Calculations 3 Technical Validation 4 Referential Integrity Rule Description If tradeside =BUY , value in column “position” should be positive & if tradeside =SELL , Position. Side. Consistent value in column “position” should be negative default sales credit + additional sales credit + sales markup = sales credit. sum up the first three attributes and compare with the sales credit. If they are not equal, then the Sales. Credits record would fail the validation. Rule Name Cusip. Is. Not. Null For each tradenum, value in column “cusip” cannot be NULL the product id shall be a valid product id. Validate the product id in the data feed against the Sec Master when cusip is provided in the feed. If the product Id in the feed Cusip. Has. Valid. Product. Id is not the same as the product id in Sec Master for the same cusip, the record should fail this rule. Results: Num Category Benchmark(Current Methodology) Passed Failed Total Supervised Learning Model Passed % Passed Failed Total Unsupervised learning Model Passed % Passed Failed Total Passed% 1 Dependencies 925, 392 214, 648 1, 140, 040 81 925, 378 214, 662 1, 140, 040 81 925, 392 214, 648 1, 140, 040 81 2 Calculations 1, 026, 976 113, 064 1, 140, 040 90 1, 026, 979 113, 061 1, 140, 040 90 1, 026, 976 113, 064 1, 140, 040 90 1, 139, 908 1, 140, 040 100 1, 129, 824 10, 216 1, 140, 040 99 Technical Validation Referential Integrity 3 4 5 132 132 Strata Data Conference - New York City
Sample Rules Proposed by Unsupervised Learning Model The following are some examples of the proposed validation rules by the unsupervised machine learning models: Rule Proposed Percentage Category sum(or_and_is_value_tradeside_B_is_position_greater_0_and_is_value_tradeside_S_is_position_less_0) 81% Matched to Rule 1 sum(or_is_value_tradeside_B_and_is_position_greater_0_is_position_less_0) 78% Similar to Rule 1 sum(or_is_value_tradeside_X_is_position_greater_0) 78% Similar to Rule 1 sum(or_is_position_greater_0_and_is_position_greater_0_is_position_less_0) 78% Similar to Rule 1 The sum of additionalsalescredit, defaultsalescredit, salesmarkup must be equal to salescredit 90% Matched to Rule 2 The sum of commissions, discountmargin must be equal to additionalsalescredit 90% Similar to Rule 2 The sum of commissions, discount, fees must be equal to additionalsalescredit 89% Similar to Rule 2 The sum of commissions, discount, salesmarkup must be equal to additionalsalescredit 89% Similar to Rule 2 The sum of commissions, discountmargin, fees must be equal to additionalsalescredit 95% Similar to Rule 2 The sum of commissions, discountmargin, defaultsalescredit must be equal to additionalsalescredit 84% Similar to Rule 2 cusip is not null 100% Matched to Rule 3 bz_date is not null 100% Similar to Rule 3 price is not null 100% Similar to Rule 3 principal is not null 100% Similar to Rule 3 productdescription is not null 100% Similar to Rule 3 productid is not null 100% Similar to Rule 3 productsubtype is not null 100% Similar to Rule 3 producttype is not null 100% Similar to Rule 3 amount is not null 100% Similar to Rule 3 productid must be equal to productid_lookup 99% Matched to Rule 4 productid_lookup must be equal to productid 99% Similar to Rule 4 6 City Strata Data Conference - New York
Lessons Learned Ø The machines learning models could be used for both business and technical validation rules generation and data anomalies detection Ø While supervised learning model matched the rule based validation results almost 100%, it requires labeled datasets which defeats the purpose of cost saving and scalability Ø Unsupervised learning model shows great potential. But the number of proposed rules needs to be reduced and the run time performance needs to be improved in order to operationalize the models Ø Data preparation takes lots of time 7 Strata Data Conference - New York City
Recommendations Ø Develop semi-supervised learning models to suggest data validation rules and capture data anomalies Ø Leverage existing data validation rules, metadata, lineage, and profile of the data, if available Ø Create Machine – Human feedback loops Ø Have data analysts review and select the rules from the set of proposed validation rules that are valid for the data Ø Data analysts investigate and resolve the suggested data exception from the selected rules Ø Verified rules and data exceptions by data analysts can be considered by the machine learning models for future runs to improve model performance 8 Strata Data Conference - New York City
Suggested Workflows Data Storage SOR 1 … SOR 3 SOR 2 Machine Learning Models Generate Rules Proposed Rules 9 City • Validation Rule 1 • Validation Rule 2 • … Select Rules Selected Rules • Validation Rule 1 • Validation Rule 2 Investigate / Remediate Anomalies Data Anomalies • Validation Rule 1 • Data Record 2 • Validation Rule 2 • Data Record 3 • Data Record 4 Strata Data Conference - New York
Questions ? 10