Data Mining and machine learning DM Lecture 1

  • Slides: 26
Download presentation
Data Mining (and machine learning) DM Lecture 1: Overview of DM, and overview of

Data Mining (and machine learning) DM Lecture 1: Overview of DM, and overview of the DM part of the DM&ML module Many of these slides are highly derivative of Nick Taylor’s slides used for this module in previous years

Overview of My Lectures All at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html •

Overview of My Lectures All at: http: //www. macs. hw. ac. uk/~dwcorne/Teaching/dmml. html • 25/9 Overview of DM (and of these 8 lectures) • 02/10: Data Cleaning - usually a necessary first step for large amounts of data • 09/10 Basic Statistics for Data Miners - essential knowledge, and very useful • 16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used much in industry • NO THURSDAY LECTURE OCTOBER 23 rd • 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data • NO THURSDAY LECTURE November 6 th • 13/11: Similarity and Correlation Measures - making sure you do clustering appropriately for the given data • 20/11: Regression - the simplest algorithm for predicting data/class values • 27/11: A Tour of Other Methods and their Essential Details - every important method you may learn about in future

Data Mining - Definition & Goal Definition • – Data Mining is the exploration

Data Mining - Definition & Goal Definition • – Data Mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules Goal • – To permit some other goal to be achieved or performance to be improved through a better understanding of the data

Some examples of huge databases Retail basket data: much commercial DM is done with

Some examples of huge databases Retail basket data: much commercial DM is done with this. In one store, 18, 000 baskets per month Tesco has >500 stores. Per year, 100, 000 baskets ? The Internet ~ >15, 000, 000 pages Lots of datasets: UCI Machine Learning repository How can we begin to understand exploit such datasets? Especially the big ones?

Like this …

Like this …

and this …

and this …

and this …

and this …

or this … (see http: //www. cs. umd. edu/hcil/treemap-history/

or this … (see http: //www. cs. umd. edu/hcil/treemap-history/

or this … • see http: //websom. hut. fi/websom/millio ndemo/html/root. html

or this … • see http: //websom. hut. fi/websom/millio ndemo/html/root. html

Data Mining - Basics • Data Mining is the process of discovering patterns and

Data Mining - Basics • Data Mining is the process of discovering patterns and inferring associations in raw data • Data Mining is a collection of powerful techniques intended to analyse large amounts of data • There is no single Data Mining approach • Data Mining can employ a range of techniques, either individually or in combination with each other

Data Mining – Why is it important? • • • Data are being generated

Data Mining – Why is it important? • • • Data are being generated in enormous quantities Data are being collected over long periods of time Data are being kept for long periods of time Computing power is formidable and cheap A variety of Data Mining software is available

Data Mining – History • The approach has its roots over 40 years ago

Data Mining – History • The approach has its roots over 40 years ago • In the early 1960 s Data Mining was called statistical analysis, and the pioneers were statistical software companies such as SPSS • By the late 1980 s these traditional techniques had been augmented by new methods such as machine induction, artificial neural networks, evolutionary computing, etc.

Data Mining – Two Major Types • Directed (Farming) – Attempts to explain or

Data Mining – Two Major Types • Directed (Farming) – Attempts to explain or categorise some particular target field such as income, medical disorder, genetic characteristic, etc. • Undirected (Exploring) – Attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes • Compare with Supervised and Unsupervised systems in machine learning

Data Mining – Tasks Classification - Example: high risk for cancer or not Estimation

Data Mining – Tasks Classification - Example: high risk for cancer or not Estimation - Example: household income Prediction - Example: credit card balance transfer average amount Affinity Grouping - Example: people who buy X, often also buy Y with a probability of Z Clustering - similar to classification but no predefined classes Description and Profiling – Identifying characteristics which explain behaviour - Example: “More men watch football on TV than women”

Data Warehousing • Note that Data Mining is very generic and can be used

Data Warehousing • Note that Data Mining is very generic and can be used for detecting patterns in almost any data – Retail data – Genomes – Climate data – Etc. • Data Warehousing, on the other hand, is almost exclusively used to describe the storage of data in the commercial sector

Data Warehousing - Definitions “A subject-oriented, integrated, time-variant and nonvolatile collection of data in

Data Warehousing - Definitions “A subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision making process” W. H. Inmon, "What is a Data Warehouse? " Prism Tech Topic, Vol. 1, No. 1, 1995 -- a very influential definition. “A copy of transaction data, specifically structured for query and analysis” Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit”

Data Warehouse – why? For organisational learning to take place data from many sources

Data Warehouse – why? For organisational learning to take place data from many sources must be gathered together over time and organised in a consistent and useful way Data Warehousing allows an organisation to remember its data and what it has learned about its data Data Mining techniques make use of the data in a Data Warehouse and subsequently add their results to it

Data Warehouse - Contents • A Data Warehouse is a copy of transaction data

Data Warehouse - Contents • A Data Warehouse is a copy of transaction data specifically structured for querying, analysis and reporting • The data will normally have been transformed when it was copied into the Data Warehouse • The contents of a Data Warehouse, once acquired, are fixed and cannot be updated or changed later by the transaction system - but they can be added to of course

Data Marts • A Data Mart is a smaller, more focused Data Warehouse –

Data Marts • A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse • A Data Mart will normally reflect the business rules of a specific business unit within an enterprise – identifying data relevant to that unit’s acitivities

From Data Warhousing to Machine Learning, via Data Marts

From Data Warhousing to Machine Learning, via Data Marts

The Big Challenge for Data Mining • The largest challenge that a Data Miner

The Big Challenge for Data Mining • The largest challenge that a Data Miner may face is the sheer volume of data in the Data Warehouse • It is very important, then, that summary data also be available to get the analysis started • The sheer volume of data may mask the important relationships in which the Data Miner is interested • Being able to overcome the volume and interpret the data is essential to successful Data Mining

What happens in practice … Data Miners, both “farmers” and “explorers”, are expected to

What happens in practice … Data Miners, both “farmers” and “explorers”, are expected to utilise Data Warehouses to give guidance and answer a limitless variety of questions The value of a Data Warehouse and Data Mining lies in a new and changed appreciation of the meaning of the data There are limitations though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them

Which brings us to “data cleaning”, next week …

Which brings us to “data cleaning”, next week …