CSE 591 Data Mining Data Preparation Web Mining
CSE 591 Data Mining, Data Preparation & Web Mining New Room: LL 271 Huan Liu, CSE, CEAS, ASU http: //www. public. asu. edu/~hliu/cse 591. html 1/23/2022 CSE 591: Data Mining by H. Liu 1
CSE 591 z. Contents Classification, Clustering, Association, Data Warehousing, Web, and Applications z. Format - A seminar course Paper reading, discussion, project, presentation z. Assessment Class participation, project proposal, presentation, exams 1/23/2022 CSE 591: Data Mining by H. Liu 2
Course Format z. Research papers - the main source to be found on the course web site z. You can choose one of the textbooks listed. A reference list is an entering point for you to access related subjects z. Everyone is expected to read the papers and participate in class discussion z. Presenters will be evaluated on the spot 1/23/2022 CSE 591: Data Mining by H. Liu 3
Paper presentation z. Each student will be responsible for one topic. All are expected to read the material(s) before the presentation. y. What is it about? y. What are points to discuss and improve? y. What can we do with it? z. Each presentation is about 35 minutes including discussion, question & answer 1/23/2022 CSE 591: Data Mining by H. Liu 4
Project z. Proposal y. Proposal presentation, discussion, revision y. A project should be completed in a semester z. Project y. Presentation and demo z. Report 1/23/2022 CSE 591: Data Mining by H. Liu 5
Topic Distribution (tentative) 1/23/2022 CSE 591: Data Mining by H. Liu 6
Your first assignment z. Think about what you want to accomplish. z. Pick an area of interest and choose a general topic for presentation. z. Registered students: send me an email with CSE 591 in the subject (use your frequently used email account so you won’t miss important announcement) with your areas of interests. z. Complete the above before the 2 nd class. 1/23/2022 CSE 591: Data Mining by H. Liu 7
Introduction z. The need for data mining z. Data warehousing z. Web mining z. Applications 1/23/2022 CSE 591: Data Mining by H. Liu 8
What is data mining z. Data mining is yextraction of useful patterns from data sources, e. g. , databases, texts, web, image. ythe analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. 1/23/2022 CSE 591: Data Mining by H. Liu 9
Patterns (1) z. Patterns are the relationships and summaries derived through a data mining exercise. z. Patterns must be: yvalid ynovel ypotentially useful yunderstandable 1/23/2022 CSE 591: Data Mining by H. Liu 10
Patterns (2) z. Patterns are used for prediction or classification describing the existing data segmenting the data (e. g. , the market) profiling the data (e. g. , your customers) etc. 1/23/2022 CSE 591: Data Mining by H. Liu 11
Data (1) z. Data mining typically deals with data that have already been collected for some purpose other than data mining. z. Data miners usually have no influence on data collection strategies. z. Large bodies of data cause new problems: representation, storage, retrieval, analysis, . . . 1/23/2022 CSE 591: Data Mining by H. Liu 12
Data (2) z. Even with a very large data set, we are usually faced with just a sample from the population. z. Data exist in many types (continuous, nominal) and forms (credit card usage records, supermarket transactions, government statistics, text, images, medical records, molecular databases). 1/23/2022 CSE 591: Data Mining by H. Liu 13
Some DM tasks z. Classification: mining patterns that can classify future data into known classes. z. Association rule mining any rule of the form X Y, where X and Y are sets of data items. z. Clustering identifying a set of similarity groups in the data 1/23/2022 CSE 591: Data Mining by H. Liu 14
z. Sequential pattern mining: A sequential rule: A B, says that event A will be immediately followed by event B with a certain confidence z. Deviation detection: discovering the most significant changes in data z. Data visualization: using graphical methods to show patterns in data. 1/23/2022 CSE 591: Data Mining by H. Liu 15
Why data mining z. Rapid computerization of businesses produces huge amounts of data z. How to make best use of data? z. A growing realization: knowledge discovered from data can be used for competitive advantage. 1/23/2022 CSE 591: Data Mining by H. Liu 16
z. Make use of your data assets z. Many interesting things you want to find cannot be found using database queries “find me people likely to buy my products” “Who are likely to respond to my promotion” z. Fast identify underlying relationships and respond to emerging opportunities 1/23/2022 CSE 591: Data Mining by H. Liu 17
Why now z. The data is abundant. z. The data is being warehoused. z. The computing power is affordable. z. The competitive pressure is strong. z. Data mining tools have become available. 1/23/2022 CSE 591: Data Mining by H. Liu 18
DM fields z. Data mining is an emerging multidisciplinary field: Statistics Machine learning Databases Visualization OLAP and data warehousing. . . 1/23/2022 CSE 591: Data Mining by H. Liu 19
Summary z. What is data mining? KDD - knowledge discovery in databases: nontrivial extraction of implicit, previously unknown and potentially useful information z. Why do we need data mining? Wide use of computer systems - data explosion - knowledge is power - but we’re data rich, knowledge lean - actionability. . . 1/23/2022 CSE 591: Data Mining by H. Liu 20
Data Warehousing z. What is a data warehouse? A repository of integrated, analysis-oriented, historical, read-only data, designed for decision support and KDD systems z. Why do we need data warehousing? Operational systems were never designed for KDD, they are numerous, of different types, with overlapping/contrary definitions 1/23/2022 CSE 591: Data Mining by H. Liu 21
An Overview of KDD Process (Guess which is which) 1/23/2022 CSE 591: Data Mining by H. Liu 22
Web mining z. The Web is a massive database z. Semi-structured data z. XML and RDF z. Web mining y. Content y. Structure y. Usage 1/23/2022 CSE 591: Data Mining by H. Liu 23
- Slides: 23