Data Mining UNITIII BIA What Is Data Mining

Data Mining UNIT-III (BIA)

What Is Data Mining? • Data mining – Extraction of interesting (previously unknown and potentially useful) patterns or knowledge from huge amount of data • Alternative names – Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence etc.

Data Mining • Data Mining is the process of extracting information from the company's various databases and re-organizing it for purposes other than what the databases were originally intended for. • It provides a means of extracting previously unknown, predictive information from the base of accessible data in data warehouses. • Data mining process is different for different organizations depending upon the nature of the data and organization. • Data mining tools use sophisticated, automated algorithms to discover hidden patterns, correlations, and relationships among organizational data. • Data mining tools are used to predict future trends and behaviors, allowing businesses to make proactive, knowledge driven decisions. • For ex: for targeted marketing, data mining can use data on past promotional mailings to identify the targets most likely to maximize the return on the company’s investment in future mailings.

Functions • Classification: It infers the defining characteristics of a certain group • Clustering: Identifies group of items that share a particular characteristic • Association: Identifies relationships between events that occur at one time • Sequencing: Similar to association, except that the relationship exists over a period of time • Forecasting: Estimates future values based on patterns within large sets of data

Why Not Traditional Data Analysis? • Tremendous amount of data • High-dimensionality of data • High complexity of data • Different types of data : – Time-series data, sequence data – Structure data, graphs, social networks and multi-linked data – Heterogeneous databases and legacy databases – Spatial, spatiotemporal, multimedia, text and Web data – Software programs, scientific simulations etc

Characteristics • Data mining tools are needed to extract the buried information. • The “miner” is often an end user, empowered by “data drills” and other power query tools to ask ad hoc questions and get answers quickly, with little or no programming skill. • The data mining environment usually has a client/server architecture. • Because of the large amounts of data, it is sometimes necessary to use parallel processing for data mining. • Data mining tools are easily combined with spreadsheets and other end user software development tools, enabling the mined data to be analyzed and processed quickly and easily. • Data mining yields five types of information: associations, sequences, classifications, clusters and forecasting.

Common data mining applications APPLICATION DESCRIPTION Market segmentation Identifies the common characteristics of customers who buys the same products from the company Customer nature Predicts which customers are likely to leave your company and go to a competitor Fraud detection Identifies which transactions are most likely to be fraudulent Direct marketing Identifies which prospects should be included in a mailing list to obtain the highest response rate Market based analysis Understands what products or services are commonly purchased together Trend analysis Reveals the difference between a typical customer this month versus last month Science Simulates nuclear explosions; visualizes quantum physics

Entertainment Models customer flows in theme parks; analyzes safety of amusement parks rides Insurance and health care Predicts which customers will buy new policies; identifies behavior patterns that increase insurance risk; spots fraudulent claims Manufacturing Optimizes product design, balancing manufacturability and safety; improves shop-floor scheduling and machine utilization Medicine Ranks successful therapies for different illnesses; predicts drug efficacy; discovers new drugs and treatments Oil and gas Analyzes seismic data for signs of underground deposits ; prioritizes drilling locations; simulates underground flows to improve recovery Retailing Discerns buying-behavior patterns; predicts how customers will respond to marketing campaigns

3 Steps Data Mining Process • Stage 1: Exploration. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records • Stage 2: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance • Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome

Some of the tools used for data mining are: • Artificial neural networks - Non-linear predictive models that learn through training and resemble biological neural networks in structure. • Decision trees - Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. • Rule induction - The extraction of useful if-then rules from data based on statistical significance. • Genetic algorithms - Optimization techniques based on the concepts of genetic combination, mutation, and natural selection. • Nearest neighbor - A classification technique that classifies each record based on the records most similar to it in an historical database.

Reasons for the growing popularity of Data Mining • Growing Data Volume • Limitations of Human Analysis • Low Cost of Machine Learning

ADVANTAGES /APPLICATIONS OF DATA MINING • Marking/Retailing: Data mining can aid direct marketers by providing them with useful and accurate trends about their customers’ purchasing behavior. • Banking/Crediting: Data mining can assist financial institutions in areas such as credit reporting and loan information.

ADVANTAGES OF DATA MINING Cont… • Law enforcement: Data mining can aid law enforcers in identifying criminal suspects as well as apprehending these criminals by examining trends in location, crime type, habit, and other patterns of behaviors. • Researchers: Data mining can assist researchers by speeding up their data analyzing process; thus, allowing them more time to work on other projects.

DISADVANTAGES OF DATA MINING • Although data mining application is a very powerful tool, it cannot stand alone by itself. To be successful, data mining needs a skilled user who will supply the correct data and a specialist who can make objective conclusions out of the output that is created. If the user supplies incorrect or minimal amount of information, output will be affected and forecast will not be credible. • Furthermore, while data mining helps the user discover patterns and relationships in data, it cannot promise perfect results, cannot explain why an outcome occurs, and cannot correct problems in your data.

DISADVANTAGES OF DATA MINING • Privacy Issues / Security issues : Although companies have a lot of personal information about us available online, they do not have sufficient security systems in place to protect that information. For example, American Express also sold their customers’ credit card purchases to another company. • Misuse of information: Some of the company will answer your phone based on your purchase history. If you have spent a lot of money or buying a lot of product from one company, your call will be answered really soon. So you should not think that your call is really being answer in the order in which it was receive.

Data Mining Techniques • • Classification Clustering Regression Association Rules

Difference between Classification and clustering • classification : supervised learning • Clustering : unsupervised learning

Supervised Classification • The input data, also called the training set, consists of multiple records each having multiple attributes or features. • Each record is tagged with a class label. • The objective of classification is to analyze the input data and to develop an accurate description or model for each class using the features present in the data. • This model is used to classify test data for which the class descriptions are not known. (1)

Supervised learning • suppose you had a basket and it is fulled with some fresh fruits your task is to arrange the same type fruits at one place. • suppose the fruits are apple, banana, cherry, grape. • so you already know from your previous work that, the shape of each and every fruit so it is easy to arrange the same type of fruits at one place. • here your previous work is called as train data mining. • so you already learn the things from your train data, This is because of you have a response variable which says you that if some fruit have so and so features it is grape, like that for each and every fruit. • This type of data you will get from the train data. • This type of learning is called as supervised learning. • This type solving problem come under Classification. • So you already learn the things so you can do you job confidently.

Unsupervised learning: • suppose you had a basket and it is fulled with some fresh fruits your task is to arrange the same type fruits at one place. • This time you don't know any thing about that fruits, you are first time seeing these fruits so how will you arrange the same type of fruits. • What you will do first you take on fruit and you will select any physical character of that particular fruit. suppose you taken color. • Then you will arrange them base on the color, then the groups will be some thing like this. • RED COLOR GROUP: apples & cherry fruits. • GREEN COLOR GROUP: bananas & grapes.

Unsupervised learning: • so now you will take another physical character as size, so now the groups will be some thing like this. RED COLOR AND BIG SIZE: apple. RED COLOR AND SMALL SIZE: cherry fruits. GREEN COLOR AND BIG SIZE: bananas. GREEN COLOR AND SMALL SIZE: grapes. job done happy ending. here you didn't know learn any thing before means no train data and noresponse variable. • This type of learning is know unsupervised learning. • clustering comes under unsupervised learning. • • •

Supervised Classification • The input data, also called the training set, consists of multiple records each having multiple attributes or features. • Each record is tagged with a class label. • The objective of classification is to analyze the input data and to develop an accurate description or model for each class using the features present in the data. • This model is used to classify test data for which the class descriptions are not known. (1)

Classification • Classification: Given a set of items that have several classes, and given the past instances (training instances) with their associated class, Classification is the process of predicting the class of a new item. • Therefore to classify the new item and identify to which class it belongs • Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications “Responds Rarely, Responds Sometimes, Responds Frequently”. • The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. • The rules could be used to predict needs of potential customers.

Technique for Classification • Decision-Tree Classifiers Job Engineer Carpenter Income <30 K Bad >50 K Good Income <40 K Income >90 K Bad Doctor Good >100 K <50 K Bad Predicting credit risk of a person with the jobs specified. Good

Classification Process (1): Model Construction Classification Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

Clustering – – – “Clustering algorithms find groups of items that are similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. ” Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. The categories are unspecified and this is referred to as ‘unsupervised learning’

Clustering • Group Data into Clusters – Similar data is grouped in the same cluster – Dissimilar data is grouped in the same cluster • How is this achieved ? – K-Nearest Neighbor • A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer). – Hierarchical • Group data into t-trees

What Is Good Clustering? • A good clustering method will produce high quality clusters in which: • the intra-class (that is, intra-cluster) similarity is high. • the inter-class similarity is low. • The quality of a clustering result also depends on both the similarity measure used by the method and its implementation. • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. • However, objective evaluation is problematic: usually done by human / expert inspection

Difference between Classification and clustering • CLASSIFICATION • We have a Training set containing data that have been previously categorized • Based on this training set, the algorithms finds the category that the new data points belong to • Since a Training set exists, we describe this technique as Supervised learning • CLUSTERING • We do not know the characteristics of similarity of data in advance • Using statistical concepts, we split the datasets into sub-datasets such that the Sub-datasets have “Similar” data • Since Training set is not used, we describe this technique as Unsupervised learning

Regression • Regression is a data mining (machine learning) technique used to fit an equation to a dataset. The simplest form of regression, linear regression, uses the formula of a straight line (y = mx + b) and determines the appropriate values for m and b to predict the value of y based upon a given value of x. • Advanced techniques, such as multiple regression, allow the use of more than one input variable and allow for the fitting of more complex models, such as a quadratic equation.

Regression • “Regression deals with the prediction of a value, rather than a class. ” • Example: Find out if there is a relationship between smoking patients and cancer related illness. • Given values: X 1, X 2. . . Xn • Objective predict variable Y • One way is to predict coefficients a 0, a 1, a 2 – Y = a 0 + a 1 X 1 + a 2 X 2 + … an. Xn – Linear Regression

Association Rules • “An association algorithm creates rules that describe how often events have occurred together. ” • Example: When a customer buys a hammer, then 90% of the time they will buy nails.

$Association Rules • Support: “is a measure of what fraction of the population satisfies$

Association Rules • Support: “is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule” • Example: – People who buy hotdog buns also buy hotdog sausages in 99% of cases. = High Support – People who buy hotdog buns buy hangers in 0. 005% of cases. = Low support • Situations where there is high support for the antecedent are worth careful attention – E. g. Hotdog sausages should be placed in near hotdog buns in supermarkets if there is also high confidence.

Data Mining Applications Here is the list of areas where data mining is widely used: • Financial Data Analysis • Retail Industry • Telecommunication Industry • Biological Data Analysis • Other Scientific Applications • Intrusion Detection

FINANCIAL DATA ANALYSIS • The financial data in banking and financial industry is generally reliable and of high quality which facilitates the systematic data analysis and data mining. Here are the few typical cases: • Design and construction of data warehouses for multidimensional data analysis and data mining. • Loan payment prediction and customer credit policy analysis. • Classification and clustering of customers for targeted marketing. • Detection of money laundering and other financial crimes.

RETAIL INDUSTRY • Data Mining has its great application in Retail Industry because it collects large amount data from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of increasing ease, availability and popularity of web. • The Data Mining in Retail Industry helps in identifying customer buying patterns and trends. That leads to improved quality of customer service and good customer retention and satisfaction. Here is the list of examples of data mining in retail industry: • Design and Construction of data warehouses based on benefits of data mining. • Multidimensional analysis of sales, customers, products, time and region. • Analysis of effectiveness of sales campaigns. • Customer Retention. • Product recommendation and cross-referencing of items.

TELECOMMUNICATION INDUSTRY • Today the Telecommunication industry is one of the most emerging industries providing various services such as fax, pager, cellular phone, Internet messenger, images, e-mail, web data transmission etc. Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why data mining is become very important to help and understand the business. • • • Multidimensional Analysis of Telecommunication data. Fraudulent pattern analysis. Identification of unusual patterns. Multidimensional association and sequential patterns analysis. Mobile Telecommunication services. Use of visualization tools in telecommunication data analysis.

BIOLOGICAL DATA ANALYSIS • Now a days we see that there is vast growth in field of biology such as genomics, proteomics, functional Genomics and biomedical research. Biological data mining is very important part of Bioinformatics. Following are the aspects in which Data mining contribute for biological data analysis: • Semantic integration of heterogeneous , distributed genomic and proteomic databases. • Alignment, indexing , similarity search and comparative analysis multiple nucleotide sequences. • Discovery of structural patterns and analysis of genetic networks and protein pathways. • Association and path analysis. • Visualization tools in genetic data analysis.

Queries?