CSE 591 Data Mining Huan Liu CSE CEAS

  • Slides: 26
Download presentation
CSE 591 Data Mining Huan Liu, CSE, CEAS, ASU http: //www. public. asu. edu/~huanliu/DM

CSE 591 Data Mining Huan Liu, CSE, CEAS, ASU http: //www. public. asu. edu/~huanliu/DM 04 S/cse 591. html 12/15/2021 CSE 591: Data Mining by H. Liu 1

CSE 591 Contents Classification, Clustering, Association, and Applications Format - A seminar course with

CSE 591 Contents Classification, Clustering, Association, and Applications Format - A seminar course with a lot of assignments and Work Paper reading, discussion, project, presentation Assessment Class participation, assignments, project proposal, presentations, exam(s) 12/15/2021 CSE 591: Data Mining by H. Liu 2

You TA: Jigar Mody, jigar. mody@asu. edu Me: Huan Liu, huanliu@asu. edu n n

You TA: Jigar Mody, jigar. mody@asu. edu Me: Huan Liu, huanliu@asu. edu n n Where: Brickyard 566 When: Right after class, other times by appointment My. ASU will be used, so make sure your email address is correct & won’t miss important announcement 12/15/2021 CSE 591: Data Mining by H. Liu 3

Course Format An experiment since Fall 2000 about effective teaching of graduate data mining

Course Format An experiment since Fall 2000 about effective teaching of graduate data mining Research papers - the main categories to be found on the course web site You can choose one of the textbooks listed. A reference list is an entering point for you to access related subjects Everyone is expected to read the papers and participate in class discussion Presenters will be evaluated on the spot 12/15/2021 CSE 591: Data Mining by H. Liu 4

Projects (25%, 10%) Exam(s) (40%) Assignment, quizzes and class participation (25%) Late penalty, YES.

Projects (25%, 10%) Exam(s) (40%) Assignment, quizzes and class participation (25%) Late penalty, YES. Academic integrity (http: //www. public. asu. edu/~huanliu/conduct. html) 12/15/2021 CSE 591: Data Mining by H. Liu 5

Paper presentation Each student will be responsible for one topic. All are expected to

Paper presentation Each student will be responsible for one topic. All are expected to search for and read the selected material(s) before the presentation. n n n What is it about? What are points to discuss and improve? What can we do with it? Each presentation is about 30 minutes including discussion, question & answer 12/15/2021 CSE 591: Data Mining by H. Liu 6

Project Proposal n n Proposal presentation, discussion, revision A project should be completed in

Project Proposal n n Proposal presentation, discussion, revision A project should be completed in a semester Project n Presentation and demo Report 12/15/2021 CSE 591: Data Mining by H. Liu 7

Topic Distribution (tentative) 12/15/2021 CSE 591: Data Mining by H. Liu 8

Topic Distribution (tentative) 12/15/2021 CSE 591: Data Mining by H. Liu 8

Categories of interests (including design and implementation) 1. Data and application security Data mining

Categories of interests (including design and implementation) 1. Data and application security Data mining and privacy 2. Data reduction and selection Streaming data reduction Dealing with large data (column- & row-wise) Selection bias 3. Learning algorithms Ensemble methods Incremental learning Active learning and co-training 4. Bioinformatics for CBS 591 12/15/2021 CSE 591: Data Mining by H. Liu 9

Your first assignment Think about what you want to accomplish. List 2 your areas

Your first assignment Think about what you want to accomplish. List 2 your areas of interests (don’t be restricted by the previous list). Pick an area of interest and choose a general topic for paper presentation. Complete the above and submit it in the 2 nd class. 12/15/2021 CSE 591: Data Mining by H. Liu 10

2 nd Assignment due in two weeks (2/5/04) – due date revised Choose your

2 nd Assignment due in two weeks (2/5/04) – due date revised Choose your category of interest Find at least 2 quality papers in that category n TA will help you and compile a list of all papers at the end Write a (< 1 page) summary for each paper n n n What is it about Why is it significant and relevant Where is it published and when 12/15/2021 CSE 591: Data Mining by H. Liu 11

Introduction The need for data mining Data mining Web mining Applications 12/15/2021 CSE 591:

Introduction The need for data mining Data mining Web mining Applications 12/15/2021 CSE 591: Data Mining by H. Liu 12

What is data mining Data mining is n n extraction of useful patterns from

What is data mining Data mining is n n extraction of useful patterns from data sources, e. g. , databases, texts, web, image. the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. 12/15/2021 CSE 591: Data Mining by H. Liu 13

Patterns (1) Patterns are the relationships and summaries derived through a data mining exercise.

Patterns (1) Patterns are the relationships and summaries derived through a data mining exercise. Patterns must be: n n valid novel potentially useful understandable 12/15/2021 CSE 591: Data Mining by H. Liu 14

Patterns (2) Patterns are used for prediction or classification describing the existing data segmenting

Patterns (2) Patterns are used for prediction or classification describing the existing data segmenting the data (e. g. , the market) profiling the data (e. g. , your customers) etc. 12/15/2021 CSE 591: Data Mining by H. Liu 15

Data (1) Data mining typically deals with data that have already been collected for

Data (1) Data mining typically deals with data that have already been collected for some purpose other than data mining. Data miners usually have no influence on data collection strategies. Large bodies of data cause new problems: representation, storage, retrieval, analysis, . . . 12/15/2021 CSE 591: Data Mining by H. Liu 16

Data (2) Even with a very large data set, we are usually faced with

Data (2) Even with a very large data set, we are usually faced with just a sample from the population. Data exist in many types (continuous, nominal) and forms (credit card usage records, supermarket transactions, government statistics, text, images, medical records, human genome databases, molecular databases). 12/15/2021 CSE 591: Data Mining by H. Liu 17

Some DM tasks Classification: mining patterns that can classify future data into known classes.

Some DM tasks Classification: mining patterns that can classify future data into known classes. Association rule mining any rule of the form X Y, where X and Y are sets of data items. Clustering identifying a set of similarity groups in the data 12/15/2021 CSE 591: Data Mining by H. Liu 18

Sequential pattern mining: A sequential rule: A B, says that event A will be

Sequential pattern mining: A sequential rule: A B, says that event A will be immediately followed by event B with a certain confidence Deviation detection: discovering the most significant changes in data Data visualization: using graphical methods to show patterns in data. 12/15/2021 CSE 591: Data Mining by H. Liu 19

Why data mining Rapid computerization of businesses produces huge amounts of data How to

Why data mining Rapid computerization of businesses produces huge amounts of data How to make best use of data? A growing realization: knowledge discovered from data can be used for competitive advantage. 12/15/2021 CSE 591: Data Mining by H. Liu 20

Make use of your data assets Many interesting things you want to find cannot

Make use of your data assets Many interesting things you want to find cannot be found using database queries “find me people likely to buy my products” “Who are likely to respond to my promotion” Fast identify underlying relationships and respond to emerging opportunities 12/15/2021 CSE 591: Data Mining by H. Liu 21

Why now The data is abundant. The data is being warehoused. The computing power

Why now The data is abundant. The data is being warehoused. The computing power is affordable. The competitive pressure is strong. Data mining tools have become available. 12/15/2021 CSE 591: Data Mining by H. Liu 22

DM fields Data mining is an emerging multidisciplinary field: Statistics Machine learning Databases Visualization

DM fields Data mining is an emerging multidisciplinary field: Statistics Machine learning Databases Visualization OLAP and data warehousing. . . 12/15/2021 CSE 591: Data Mining by H. Liu 23

Summary What is data mining? KDD - knowledge discovery in databases: non-trivial extraction of

Summary What is data mining? KDD - knowledge discovery in databases: non-trivial extraction of implicit, previously unknown and potentially useful information Why do we need data mining? Wide use of computer systems - data explosion - knowledge is power - but we’re data rich, knowledge lean - actionability. . . 12/15/2021 CSE 591: Data Mining by H. Liu 24

An Overview of KDD Process (Guess which is which) 12/15/2021 CSE 591: Data Mining

An Overview of KDD Process (Guess which is which) 12/15/2021 CSE 591: Data Mining by H. Liu 25

Web mining – an application The Web is a massive database Semi-structured data XML

Web mining – an application The Web is a massive database Semi-structured data XML and RDF Web mining n n n Content Structure Usage 12/15/2021 CSE 591: Data Mining by H. Liu 26