Welcome to Intro to Bioinformatics Bioinformatics in Space

Welcome to Intro to Bioinformatics

Bioinformatics in Space Intergalactic Border Patrol Tribbles Trogs Warning! Highly dangerous! Cute and harmless.

Welcome to the Intergalactic Detention Center Please answer the following questions 1……………. . 10 1. Like broccoli 2. Floss every brushing 3. Enjoy ballet 4. Always pair socks 5. Liked Moby Dick 6. Eat the Maraschino cherry

Responses to questionnaire T 1 T 2 T 3 T 4 T 5 T 6 T 7 . . . 1. Broccoli 9. 2 1. 6 4. 0 5. 2 2. 2 9. 1 1. 0 . . . 2. Floss 2. 2 1. 9 1. 0 4. 6 7. 6 9. 8 1. 0 . . . 3. Ballet 8. 3 3. 1 2. 4 6. 1 9. 3 9. 2 1. 0 . . . 4. Pair socks 9. 6 5. 5 1. 3 8. 4 9. 8 9. 0 1. 0 . . . 5. Moby Dick 4. 2 2. 1 1. 0 4. 1 5. 2 4. 4 1. 0 . . . 6. Maraschino 6. 4 8. 9 7. 1 3. 3 1. 9 2. 0 1. 0 . . . 1. 2 1. 5 5. 1 3. 4 1. 1 1. 7 9. 9 . . . 6817. Mac. Arthur’s Park You need a plan

A Plan • Release all Tribbles / Trogs • Note outcome for each individual • Deduce identities • Integrate identities into results • Figure out which questions/answers informative

Responses to questionnaire T 1 T 2 T 3 T 4 T 5 T 6 T 7 . . . 1. Broccoli 9. 2 1. 6 4. 0 5. 2 2. 2 9. 1 1. 0 . . . 2. Floss 2. 2 1. 9 1. 0 4. 6 7. 6 9. 8 1. 0 . . . 3. Ballet 8. 3 3. 1 2. 4 6. 1 9. 3 9. 2 1. 0 . . . 4. Pair socks 9. 6 5. 5 1. 3 8. 4 9. 8 9. 0 1. 0 . . . 5. Moby Dick 4. 2 2. 1 1. 0 4. 1 5. 2 4. 4 1. 0 . . . 6. Maraschino 6. 4 8. 9 7. 1 3. 3 1. 9 2. 0 1. 0 . . . 1. 2 1. 5 5. 1 3. 4 1. 1 1. 7 9. 9 . . . 6817. Mac. Arthur’s Park Tribbles (what now? ) Trogs

Responses to questionnaire T 1 T 2 T 3 T 4 T 5 T 6 T 7 Mean 1. Broccoli 9. 2 1. 6 4. 0 5. 2 2. 2 9. 1 1. 0 6. 4 2. 2 2. Floss 2. 2 1. 9 1. 0 4. 6 7. 6 9. 8 1. 0 6. 0 1. 3 3. Ballet 8. 3 3. 1 2. 4 6. 1 9. 3 9. 2 1. 0 8. 2 2. 2 4. Pair socks 9. 6 5. 5 1. 3 8. 4 9. 8 9. 0 1. 0 9. 2 2. 6 5. Moby Dick 4. 2 2. 1 1. 0 4. 1 5. 2 4. 4 1. 0 4. 4 1. 4 6. Maraschino 6. 4 8. 9 7. 1 3. 3 1. 9 2. 0 1. 0 4. 4 3. 7 1. 2 1. 5 5. 1 3. 4 1. 1 1. 7 9. 9 1. 8 5. 5 . . . 6817. Mac. Arthur’s Park Tribbles Trogs

Which questions are informative? Which can be used to predict class? The responses to which questions are correlated with class? 1……………………. . 10 Δμ Correlation of question with class = Δμ σ + σ

Which questions are informative? Which can be used to predict class? Strategy • Calculate correlation for each question • Look for questions with largest correlations with class Implementation Correlation = Δμ 1……………. . 10 σ + σ μ = (Σ s ) / N

Which questions are informative? Which can be used to predict class? Strategy • Calculate correlation for each question • Look for questions with largest correlations with class Implementation Correlation = Δμ 1……………. . 10 σ + σ2 = [Σ (s - μ)2 / (N-1)] σ = sqrt(σ)

Which questions are informative? Which can be used to predict class? Strategy • Calculate correlation for each question • Look for questions with largest correlations with class Implementation Correlation = = Δμ σ + σ (Σ s)/ N - (Σ s)/N sqrt(Σ (s - μ)2 / (N-1)] + sqrt(Σ (s - μ)2 / (N-1))

Which questions are informative? Which can be used to predict class? Implementation Correlation = = Δμ σ + σ (Σ s)/ N - (Σ s)/N sqrt(Σ (s - μ)2 / (N-1)] + sqrt(Σ (s - μ)2 / (N-1)) Read_Responses_To_Question(); $numerator = Mean(@tribble_scores) – Mean(@trog_scores); $denominator = St. Dev(@tribble_scores) + St. Dev(@trog_scores); $correlation = $numerator / $denominator; push @question_info, [$question_number, $correlation];

Which questions are informative? Which can be used to predict class? Implementation Correlation = = Δμ σ + σ (Σ s)/ N - (Σ s)/N sqrt(Σ (s - μ)2 / (N-1)] + sqrt(Σ (s - μ)2 / (N-1)) while (<INPUT>) { Read_Responses_To_Question(); $numerator = Mean(@tribble_scores) – Mean(@trog_scores); $denominator = St. Dev(@tribble_scores) + St. Dev(@trog_scores); $correlation = $numerator / $denominator; push @question_info, [$question_number, $correlation]; }

Which questions are informative? Which can be used to predict class? Implementation sub Mean { my @scores = @_; # Grab Tribble or Trog scores my $s_sum = 0; # Start Σ at 0 my $N = 0; # Need to count N foreach my $score (@scores) { $s_sum = $s_sum + $score; $N = $N + 1; } return $s_sum / $N; # mean = (Σ s)/ N

Which questions are informative? Which can be used to predict class? Results Question Correlation 3497 281 1114 1. 76 1. 72 1. 71 … … Are these questions good predictors of class? Suppose there are NO good predictors of class…

(Interlude) NEWS! Precinct in Harrisonburg has voted for the winning senatorial candidate every time for the past ten elections! (Probability if by chance = (1/2) · … = (1/2)10 = 1/1024 1/1000 Suppose there are 1000 precincts in Virginia… (BLAST from the past) E = (probability) · (number of combinations) Beware the fallacy of the unlikely result!

Which questions are informative? Which can be used to predict class? Results Question Correlation 3497 281 1114 1. 76 1. 72 1. 71 … … Are these questions good predictors of class? Suppose there are NO good predictors of class… … what would be the expected correlation?

Which questions are informative? How to test class predictors? Choice #1 Rerun time with the different (? ) reality that Tribbles are no different from Trogs Choice #2 Use random data ? ? ?

Random responses to questionnaire T 1 1. Broccoli 9. 2 T 2 -1600 T 3 T 4 T 5 331/3 99 3. 14159 2. Floss 3. Ballet 4. Pair socks 5. Moby Dick 6. Maraschino. . . 6817. Mac. Arthur’s Park Random doesn’t mean crazy T 6 T 7 . . . -0 1. 0 . . .

Random responses to questionnaire T 1 T 2 T 3 T 4 T 5 T 6 T 7 . . . 1. Broccoli 9. 2 1. 6 4. 0 5. 2 2. 2 9. 1 1. 0 . . . 2. Floss 2. 2 1. 9 1. 0 4. 6 7. 6 9. 8 1. 0 . . . 3. Ballet 8. 3 3. 1 2. 4 6. 1 9. 3 9. 2 1. 0 . . . 4. Pair socks 9. 6 5. 5 1. 3 8. 4 9. 8 9. 0 1. 0 . . . 5. Moby Dick 4. 2 2. 1 1. 0 4. 1 5. 2 4. 4 1. 0 . . . 6. Maraschino 6. 4 8. 9 7. 1 3. 3 1. 9 2. 0 1. 0 . . . 1. 2 1. 5 5. 1 3. 4 1. 1 1. 7 9. 9 . . . 6817. Mac. Arthur’s Park Maybe but…

Random responses to questionnaire T 1 T 2 T 3 T 4 T 5 T 6 T 7 . . . 1. Broccoli 9. 2 1. 6 4. 0 5. 2 2. 2 9. 1 1. 0 . . . 2. Floss 2. 2 1. 9 1. 0 4. 6 7. 6 9. 8 1. 0 . . . 3. Ballet 8. 3 3. 1 2. 4 6. 1 9. 3 9. 2 1. 0 . . . 4. Pair socks 9. 6 5. 5 1. 3 8. 4 9. 8 9. 0 1. 0 . . . 5. Moby Dick 4. 2 2. 1 1. 0 4. 1 5. 2 4. 4 1. 0 . . . 6. Maraschino 6. 4 8. 9 7. 1 3. 3 1. 9 2. 0 1. 0 . . . 1. 2 1. 5 5. 1 3. 4 1. 1 1. 7 9. 9 . . . 6817. Mac. Arthur’s Park Keep the data, shuffle the players

Which questions are informative? How to test class predictors? Choice #1 Rerun time with the different (? ) reality that Tribbles are no different from Trogs Choice #2 Use random data Choice #3 Shuffle data

Which questions are informative? How to test class predictors? 10000 # of questions with better correlations 1000 5% of shuffled responses 100 10 0 2. 0 1. 5 1. 0 0. 5 Correlation 0 -0. 5

Which questions are informative? How to test class predictors? 10000 # of questions with better correlations 1000 Actual responses 1% of shuffled responses 100 10 0 2. 0 1. 5 1. 0 0. 5 Correlation 0 -0. 5

Which questions are informative? How to test class predictors? 10000 # of questions with better correlations 1000 Actual responses 1% of shuffled responses 100 10 0 2. 0 1. 5 1. 0 0. 5 Correlation If class predictors don’t work If class predictors are valid 0 -0. 5