Programming for Engineers in Python Lecture 3 Data

  • Slides: 32
Download presentation
Programming for Engineers in Python Lecture 3: Data Analysis Autumn 2011 -12 1

Programming for Engineers in Python Lecture 3: Data Analysis Autumn 2011 -12 1

Lecture 2: Highlights • Simulation: power lines and rare diseases • Plan before coding

Lecture 2: Highlights • Simulation: power lines and rare diseases • Plan before coding • Using Modules • Import • Constants 2

Tuples • Fixed size • Immutable (similarly to Strings) • What are they good

Tuples • Fixed size • Immutable (similarly to Strings) • What are they good for (compared to list)? • Simpler (“light weight”) • Staff multiple things into a single container • Immutable (e. g. , records in database) 3

Dictionaries (Hash Tables) • Key – Value mapping • Fast! • Usage: • Database

Dictionaries (Hash Tables) • Key – Value mapping • Fast! • Usage: • Database • Dictionary • Phone book keys values 4

Dictionaries (Cont. ) 5

Dictionaries (Cont. ) 5

Dictionaries (Cont. ) 6

Dictionaries (Cont. ) 6

Dict – Initiate Dictionary from a List 7

Dict – Initiate Dictionary from a List 7

Sorting Lists 8

Sorting Lists 8

Types and Casting 9

Types and Casting 9

Today: Data Analysis of data is a process of inspecting, cleaning, transforming, and modeling

Today: Data Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. • Descriptive • Predictive 10

Data Analysis Examples • Stock market trends • Genome-disease association • Face recognition •

Data Analysis Examples • Stock market trends • Genome-disease association • Face recognition • Production yield • Business intelligence • Speech recognition • Text categorization 11

Text Categorization / Document Classification 12

Text Categorization / Document Classification 12

How is it Done? • Manually • Automatically • Gather document statistics • Measure

How is it Done? • Manually • Automatically • Gather document statistics • Measure how similar it is to documents in each category • Today we will collect word-statistics from several well known books

Plan • Find data • Collect word statistics • Observe results

Plan • Find data • Collect word statistics • Observe results

Find Data • This might be the hardest task for many applications! • Project

Find Data • This might be the hardest task for many applications! • Project Gutenberg (http: //www. gutenberg. org/) • Alice's Adventures in Wonderland (http: //www. gutenberg. org/cache/epub/11/pg 11. txt) • The Bible, King James version, Book 1: Genesis (http: //www. gutenberg. org/cache/epub/8001/pg 8001. txt)

Reading a Book

Reading a Book

Flying 17

Flying 17

Print Most Popular Words (High Level)

Print Most Popular Words (High Level)

Modular Programming • Top-down approach: first write what you plan to do and then

Modular Programming • Top-down approach: first write what you plan to do and then implement the details • Clear for readers • Easier to debug • Easier to update

Print. Most. Popular Build Word-Occurrences Dictionary

Print. Most. Popular Build Word-Occurrences Dictionary

Print. Most. Popular Sort Words by Occurrences ? http: //docs. python. org/library/operator. html

Print. Most. Popular Sort Words by Occurrences ? http: //docs. python. org/library/operator. html

The Code

The Code

Results

Results

And Now for Several Books

And Now for Several Books

Results The word “to” as an example Bible L. Carroll

Results The word “to” as an example Bible L. Carroll

How is it Really Done? • Preprocessing (e. g. , words to lower case,

How is it Really Done? • Preprocessing (e. g. , words to lower case, remove punctuation signs) • Word count • Enhance statistics • • Discard stop words (e. g. , and, of, a) Stemming (e. g. , go & went) Synonyms ( )מילים נרדפות bigrams, trigrams • Similarity measures to existing documents / categories

How is it Really Done? Categories Topics: http: //www. cs. tau. ac. il/courses/py. Prog/1112

How is it Really Done? Categories Topics: http: //www. cs. tau. ac. il/courses/py. Prog/1112 a/lectures/3/topics. rbb Categories Hierarchy: http: //www. cs. tau. ac. il/courses/py. Prog/1112 a/lectures/3/rcv 1. topics. hier. orig

How is it Really Done? Enhance Statistics Stop words: http: //www. cs. tau. ac.

How is it Really Done? Enhance Statistics Stop words: http: //www. cs. tau. ac. il/courses/py. Prog/1112 a/lectures/3/english. stop After processing: http: //www. cs. tau. ac. il/courses/py. Prog/1112 a/lectures/3/lyrl 2004 -non-v 2_tokens_test_pt 0. dat

Off Topic: Find the Error

Off Topic: Find the Error

And Now?

And Now?

And Now?

And Now?

INDENTATION IS REALLY, REALLY IMPORTANT!

INDENTATION IS REALLY, REALLY IMPORTANT!