Stat E100 Lecture 1 Introduction to Data Ethan

  • Slides: 110
Download presentation
Stat E-100 Lecture 1: Introduction to Data Ethan Fosse, Ph. D.

Stat E-100 Lecture 1: Introduction to Data Ethan Fosse, Ph. D.

Teaching Staff • Instructor: Ethan Fosse, Ph. D. • Email: efosse@fas. harvard. edu •

Teaching Staff • Instructor: Ethan Fosse, Ph. D. • Email: efosse@fas. harvard. edu • Head Teaching Assistant: Mark Ouchida, A. L. M. , M. Ed. • Email: mark_ouchida@harvard. edu • Teaching Assistants: Regan Bernhard, Rohit Goyal, Kela Roberts • See course website for contact information

Course Textbook • Open. Intro Statistics, 3 rd Edition • Available free online at:

Course Textbook • Open. Intro Statistics, 3 rd Edition • Available free online at: https: //www. openintro. org/stat/textbook. p hp • Also may be purchased on Amazon for about $10 • My philosophy: Textbooks should be high quality but inexpensive

Assigned TA and Sections • You will be assigned a teaching assistant (TA) next

Assigned TA and Sections • You will be assigned a teaching assistant (TA) next week • Your TA is your first point of contact for questions about the problem sets, key ideas, and statistical concepts • We will offer several sections that will cover the course material in more depth • You may attend any (or all) sections

Course Grade for Undergraduates • Regularly-assigned problem sets that will count for 30% of

Course Grade for Undergraduates • Regularly-assigned problem sets that will count for 30% of your grade • A midterm exam that will count for 30% of your grade • A final exam that will count for 40% of your grade • See syllabus for the course grade for graduate students (an additional final project is required)

Problem Sets • Problem sets will be posted online • They will be based

Problem Sets • Problem sets will be posted online • They will be based on problems in the Open. Intro statistics textbook • There are ten problem sets throughout the semester • First problem set is due on September 17 th and will be posted online next week

Midterm and Final Exams • The midterm exam is given on October 22 nd

Midterm and Final Exams • The midterm exam is given on October 22 nd and the final exam on December 17 th • Both exams are given entirely online through the course website • Each exam will last 2 hours and can be taken at any point during a 24 -hour time period

Midterm and Final Exams (cont. ) • Each exam consists of three sections: first,

Midterm and Final Exams (cont. ) • Each exam consists of three sections: first, a set of multiple choice questions; second, a set of numerical answer questions; finally, several long answer questions • Both exams are entirely “open book” in that you can use any textbook, notes, problem sets, or lecture slides to help you answer the questions

Statistical Programming Languages • Excel: popular among Wall Street types and business school grads

Statistical Programming Languages • Excel: popular among Wall Street types and business school grads (has limited programming functionality) • MATLAB: often used by neuroscientists and engineers • SPSS: widely used by psychologists but has limited functionality • SAS: popular among biologists and some biostatisticians • Python: popular among computer scientists; somewhat limited statistics functionality but improving • Stata: popular among economists and financial analysts; also used by biostatisticians • R: used by statisticians, biostatisticians, and cool people; often used with RStudio

R is Extremely Popular!

R is Extremely Popular!

Downloading and Installing RStudio • Where to download RStudio: http: //www. rstudio. com/products/rstudio/dow nload/

Downloading and Installing RStudio • Where to download RStudio: http: //www. rstudio. com/products/rstudio/dow nload/ • How are R and RStudio related? • R is the underlying programming language we use, while Rstudio provide a graphical front-end that has a number of useful features, including syntax highlighting and windowing • For this course we use RStudio with R, but there are other graphical front-ends that some people use with R

Supplementary Texts for R • For learning R, you are recommended to use the

Supplementary Texts for R • For learning R, you are recommended to use the Five College Guide to R and Start Teaching with R • Both are posted on the course website • R will be taught in the course sections based on the material covered in these texts as well as during lecture

Key R Commands for Each Lecture • Each lecture we will post a handout

Key R Commands for Each Lecture • Each lecture we will post a handout for the key R commands used • These commands will be reviewed during the sections • Most of the commands will be based on the R package “mosaic”

Using the Mosaic Package • To install the mosaic package: install. packages(“mosaic”) • To

Using the Mosaic Package • To install the mosaic package: install. packages(“mosaic”) • To load the mosaic package when you start your R session: library(“mosaic”)

Tentative Course Map Introduction to Data Categorical Data Numerical Data Descriptive Statistics Probability Tables

Tentative Course Map Introduction to Data Categorical Data Numerical Data Descriptive Statistics Probability Tables and Relative Risk Correlation Analysis Simple Linear Regression Basics of Sampling Distribution Tests for Means Tests for Contingency Tables and ANOVA Inferences for Correlation and Regression Course Review Tests for Proportions

Myths About Statistics

Myths About Statistics

Four Misconceptions About Statistics There are several, but I will begin with four: 1.

Four Misconceptions About Statistics There are several, but I will begin with four: 1. Statistics is mathematics 2. Statistics is dry and dull (and definitely not “sexy”) 3. Statistics is irrelevant to the real world 4. Statistics is not really that useful except for research scientists These are completely incorrect. Let’s review these!

Myth 1: Statistics is Mathematics

Myth 1: Statistics is Mathematics

Reminder: Statistics is Not Mathematics! • Mathematics is better understood as a tool statisticians

Reminder: Statistics is Not Mathematics! • Mathematics is better understood as a tool statisticians use, but it’s not the only skill or tool • Often statisticians use mathematics because they can clarify what’s going on, but often the biggest issues and problems are conceptual • We use mathematics in statistics, but we also use computer programming, graphic design, substantive expertise, and creativity, just to name a few other things!

Why No Precocious Statisticians? The statistician M. G. Kendall on The Future of Statistics

Why No Precocious Statisticians? The statistician M. G. Kendall on The Future of Statistics (1968): “…although musicians are often precocious, poets never are. One can draw the same kind of distinction between mathematicians, who are usually precocious, and statisticians who, as statisticians, are not. There is a certain apprenticeship in handling real-life situations to be served before an individual is mature enough to tackle important statistical problems. ”

Myth 2: Statistics is Dry or Dull

Myth 2: Statistics is Dry or Dull

Data Everywhere (1)Exponential growth and availability of all sorts of data (“Internet of Things”)

Data Everywhere (1)Exponential growth and availability of all sorts of data (“Internet of Things”) (2) Huge gap between exponential growth of data and number of people who can analyze these data (3) Need for analysts who can “tell stories” or “give meaning” • This is why statistics is not equivalent to math • Bottom line: we have more data than people to analyze them!

The World Needs You! The Amount of Data is Growing Exponentially At no other

The World Needs You! The Amount of Data is Growing Exponentially At no other time do we more urgently need careful data analysts and thinkers such as yourself to find those hidden facts and patterns that will change our world!

Statistics as “Data Science”

Statistics as “Data Science”

What is “Data Science”? • Data science is a recently coined term from the

What is “Data Science”? • Data science is a recently coined term from the 1970 s • 1988: the statistician C. F. Jeff Wu elaborated on the phrase in a talk titled “Statistics = Data Science? ” • Argued that statisticians should rebrand themselves as “data scientists” since spend most of their time manipulating and experimenting with data • Until recently, data scientists have tended to focus more on cleaning datasets and programming • My own view: “data science” = “applied statistics”

Data Science Venn Diagram Statistical problems and concepts! Stata and R are programming languages

Data Science Venn Diagram Statistical problems and concepts! Stata and R are programming languages Stat 102 = “Data Science”! Upcoming regression project will help us delve deeper into the middle of this Venn Diagram From medicine, biology, or your own experience! “From what I know about the world, smoking is related to cancer. ”

To Recap: Statistics as “Data Science” • Statistics has been around for a relatively

To Recap: Statistics as “Data Science” • Statistics has been around for a relatively long time • “Data Science” is a new buzzword reflecting the huge increase in data over the past several decades • No clear-cut definition • My own view: “data science” can be understood as “applied statistics” • Not just theory, but practical application is important!

Myth 3: Statistics is Irrelevant

Myth 3: Statistics is Irrelevant

Statistics is Everywhere • Statistics is used in every field today • With statistics

Statistics is Everywhere • Statistics is used in every field today • With statistics we can gain new knowledge about everything from healthy eating to the fundamental particles of the Universe • Statistics is also foundational to constructing and building things, from skyscrapers to airplanes to websites • It is also in extremely high demand

Example 1: Best Jobs of 2014

Example 1: Best Jobs of 2014

Example 2: High Demand for Data Analysts

Example 2: High Demand for Data Analysts

I just picked these examples from the news from just one week. These articles

I just picked these examples from the news from just one week. These articles appear all the time. Statistics is in very high demand!

Myth 4: Statistics is Only for Scientists

Myth 4: Statistics is Only for Scientists

Interpreting Test Results

Interpreting Test Results

Importance of Statistical Literacy

Importance of Statistical Literacy

Many Other Fields Use Statistics! The applications are endless: sociology, anthropology (including cultural anthropology),

Many Other Fields Use Statistics! The applications are endless: sociology, anthropology (including cultural anthropology), political science, psychology, and so forth! We will cover these applications through examples over the semester.

Four Misconceptions About Statistics Revisited There are several, but I will begin with four:

Four Misconceptions About Statistics Revisited There are several, but I will begin with four: 1. Statistics is creativity, programming, communication, personal experience, graphic design – and mathematics 2. Statistics is “sexy” and “data science” 3. Statistics is not only relevant, but essential 4. Statistics is incredibly useful for not only scientists but medical doctors, anthropologists, historians (and others)

Overview Lecture 1: Introduction to Data

Overview Lecture 1: Introduction to Data

Course Map Introduction to Data Categorical Data Numerical Data Descriptive Statistics Probability Tables and

Course Map Introduction to Data Categorical Data Numerical Data Descriptive Statistics Probability Tables and Relative Risk Correlation Analysis Simple Linear Regression Basics of Sampling Distribution Tests for Means Tests for Contingency Tables and ANOVA Inferences for Correlation and Regression Course Review Tests for Proportions

Overview of Week 1: Introduction to Data Part 1: Types of Data Part 2:

Overview of Week 1: Introduction to Data Part 1: Types of Data Part 2: Samples and Populations Part 3: Basics of Study Design

Part 1 Types of Data

Part 1 Types of Data

What is Statistics? • Statistics is the study of how best to collect, analyze,

What is Statistics? • Statistics is the study of how best to collect, analyze, and draw conclusions from data • Modern statistics as a field developed only in the early 20 th century, but the foundations lie much earlier • Statistics can be viewed in the context of the general process of scientific investigation

Process of Investigation 1. Identify a question or problem 2. Collect relevant data on

Process of Investigation 1. Identify a question or problem 2. Collect relevant data on the topic 3. Analyze the data 4. Make a conclusion Example: Will aspirin help stop my headache?

Everyday Examples • Does working out in the morning make me more energized? •

Everyday Examples • Does working out in the morning make me more energized? • Will eating a watermelon relieve my thirst? • If I study one hour a day, can I ace the final exam? • Can I lose weight if I just avoid cream with my coffee in the morning? We do this all the time!

Goal of Statistics 1. 2. 3. 4. Identify a question or problem Collect relevant

Goal of Statistics 1. 2. 3. 4. Identify a question or problem Collect relevant data on the topic Analyze the data Make a conclusion Statistics is focused on making steps 2 to 4 more rigorous, reproducible, and reliable

Substantive Focus for this Course 1. 2. 3. 4. Identify a question or problem

Substantive Focus for this Course 1. 2. 3. 4. Identify a question or problem Collect relevant data on the topic Analyze the data Make a conclusion We will (in general) focus on questions or problems related to the social sciences and humanities

What are Data? • Data are a set of observations • These observations can

What are Data? • Data are a set of observations • These observations can be collected in various ways, such as through field notes, surveys, and experiments • Often (but not always) in the social sciences and humanities, the focus is on collecting data on people

Data or Datum? • The singular form is “datum, ” so we say “that

Data or Datum? • The singular form is “datum, ” so we say “that datum is very low” • “Data” is the plural form, but it may be used with a singular or plural verb • Both are grammatically correct: • “The data were analyzed after they were collected. ” • “The data was analyzed after it was collected. ”

Real-World Examples • Do kids think sports is important for being popular? • Were

Real-World Examples • Do kids think sports is important for being popular? • Were men more likely to die on the Titanic than women? • Could we have predicted who won the 2008 presidential election in the United States? We will explore these kind of questions in this course!

Obama vs. Mc. Cain in 2008 Source: Wikipedia

Obama vs. Mc. Cain in 2008 Source: Wikipedia

2008 Election Results Source: Wikipedia

2008 Election Results Source: Wikipedia

Back in October 2008 • Suppose you were a political analyst in October 2008

Back in October 2008 • Suppose you were a political analyst in October 2008 who wanted to predict the 2008 election • You’ve had an assistant collect data on the characteristics of U. S. states • How did your assistant organize the data so you could analyze it?

Organizing Data • Data are usually organized into a dataset (or dataframe) • The

Organizing Data • Data are usually organized into a dataset (or dataframe) • The rows of the dataset are called cases (or observational units) • The columns of the dataset are called variables • The variables are characteristics or traits of the cases

Dataset Organization Cases (U. S. states) Variables (characteristics of U. S. states) …

Dataset Organization Cases (U. S. states) Variables (characteristics of U. S. states) …

Keeping Track of Cases • … …

Keeping Track of Cases • … …

How Many Cases? • … …

How Many Cases? • … …

Many Different Kinds of Cases • Cases (that is, the rows of the dataset)

Many Different Kinds of Cases • Cases (that is, the rows of the dataset) need not be U. S. states • Other examples: • Respondents who answer a survey • Subjects or respondents involved in an experiment • Records of transactions or objects • Different historical events

Each column is a different variable or characteristic of the cases: … State: Name

Each column is a different variable or characteristic of the cases: … State: Name of the U. S. state Region: Geographic region of U. S. state Population: Population (in millions) High. School: Percent who graduated High School

Be Careful of the Units of Measurement … Example: What is population of Alabama?

Be Careful of the Units of Measurement … Example: What is population of Alabama? About 4. 53 million people, not 4. 53 people!

Many Types of Variables In general, variables can be considered either numerical or categorical

Many Types of Variables In general, variables can be considered either numerical or categorical All Variables Numerical Continuous Discrete Categorical Nominal Ordinal

Numerical Variables • Numerical variables have values that are numbers • For numerical variables,

Numerical Variables • Numerical variables have values that are numbers • For numerical variables, it usually makes sense to add, subtract, or take averages with those values • Examples: Age of respondent, miles per gallon in a car, ranking of colleges

Continuous vs. Discrete •

Continuous vs. Discrete •

Continuous or Discrete? •

Continuous or Discrete? •

Categorical Variables • Categorical variables have values that are different categories (or qualities) •

Categorical Variables • Categorical variables have values that are different categories (or qualities) • Sometimes a categorical variable is called a factor and the different categories are called levels • Examples: Gender of respondent, approval or disapproval of a social policy, regular smoker or not

Nominal vs. Ordinal • Nominal variables have no inherent order to their categories •

Nominal vs. Ordinal • Nominal variables have no inherent order to their categories • Examples: Religious identification, political party affiliation, names of countries visited • Ordinal variables usually have an expected ordering to their categories • Examples: Highest educational degree attained, level of approval in a country’s economy policy (from “Strongly Disapprove” to “Strongly Approve”) • Not always clear-cut!

Numerical vs. Categorical For this course we will focus mainly on the distinguishing between

Numerical vs. Categorical For this course we will focus mainly on the distinguishing between numerical and categorical variables All Variables Numerical Continuous Discrete Categorical Nominal Ordinal

Our Dataset Revisited Cases (U. S. states) Numerical and Categorical Variables (characteristics of U.

Our Dataset Revisited Cases (U. S. states) Numerical and Categorical Variables (characteristics of U. S. states) …

What Type of Variable? … State: Name of the U. S. state • Categorical

What Type of Variable? … State: Name of the U. S. state • Categorical (nominal) • Categories (or levels) are “Alabama, ” “Alaska, ” “Arizona, ” … “Wyoming”

What Type of Variable? … Region: Geographic region of U. S. state • Categorical

What Type of Variable? … Region: Geographic region of U. S. state • Categorical (nominal) • Categories (or levels) are “Northeast, ” “Midwest, ” “West, ” “South”

What Type of Variable? … •

What Type of Variable? … •

What Type of Variable? … •

What Type of Variable? … •

Summary: Cases (U. S. states) Categorical (Nominal) Numerical (Continuous) …

Summary: Cases (U. S. states) Categorical (Nominal) Numerical (Continuous) …

Things to Know • Data are a set of observations about the world around

Things to Know • Data are a set of observations about the world around us • To prepare for data analysis, we organize our data into a dataset • The rows of a dataset are the cases and the columns are variables • Variables may be numerical or categorical

Part 2 Samples and Populations

Part 2 Samples and Populations

Research Questions • What is the average number of words spoken daily among 10

Research Questions • What is the average number of words spoken daily among 10 -year-old schoolchildren in New York City? • Do citizens of the United States approve or disapprove of legalizing marijuana use? • Does meditating for 15 minutes each day improve happiness among college students in Beijing?

Defining a Population • Each research question refers to a target population • A

Defining a Population • Each research question refers to a target population • A population is the entire collection of cases about which information is desired • For example, if we want to know the percentage of Buddhists in Japan, the population is all people in Japan

Name the Population • Research Question: What is the average number of words spoken

Name the Population • Research Question: What is the average number of words spoken daily among 10 year-old schoolchildren in New York City? • Target population: All 10 -year-old children who attend school in New York City

Name the Population • Research Question: Do citizens of the United States approve or

Name the Population • Research Question: Do citizens of the United States approve or disapprove of legalizing marijuana use? • Target population: All citizens of the United States

Name the Population • Does meditating for 15 minutes each day improve happiness among

Name the Population • Does meditating for 15 minutes each day improve happiness among college students in Beijing? • Target population: All college students in Beijing

Conducting a Census • Collecting data on the population is called conducting a census

Conducting a Census • Collecting data on the population is called conducting a census • Censuses conducted throughout history for purposes of empire-building • Taxation of citizens • Aid reproduction of population • Recruit able-bodied men for war

Example: Census of Turin, Italy • The French were attacking Turin, Italy in 1705

Example: Census of Turin, Italy • The French were attacking Turin, Italy in 1705 • In part to recruit men for war, city officials surveyed all inhabitants • Variables included name, age, gender, birthplace, and weapons kept in the household

Turin, Italy 1705 Census Cases (Residents of Turin, Italy) Numerical and Categorical Variables (characteristics

Turin, Italy 1705 Census Cases (Residents of Turin, Italy) Numerical and Categorical Variables (characteristics of residents) …

Problems with a Census • Often very difficult and expensive to conduct • Hard-to-reach

Problems with a Census • Often very difficult and expensive to conduct • Hard-to-reach groups such as undocumented migrants and homeless can further increase the cost further • Ethical concerns about requiring everyone to participate in the census • May constitute a violation of one’s privacy

Solution: Collect a Sample Rather than conducting a census, we can collect a sample

Solution: Collect a Sample Rather than conducting a census, we can collect a sample which is a subset of the population. Population Sample

We Take Samples All the Time • Suppose you’re at a giant buffet with

We Take Samples All the Time • Suppose you’re at a giant buffet with hundreds of different entrees • Most people will try out some subset of the entrees before making a conclusion about what to eat • People also sample everyday when buying things, listening to music, and making decisions

From Samples to Populations • Because samples are taken so frequently, we often distinguish

From Samples to Populations • Because samples are taken so frequently, we often distinguish between descriptive and inferential statistics • Descriptive statistics is organizing and presenting data from a sample or population • Inferential statistics is making conclusions about a population based only on data from a sample

Descriptive Statistics • Presenting data • For example, tables and graphs • Summarizing data

Descriptive Statistics • Presenting data • For example, tables and graphs • Summarizing data • For example, average height of respondents in a sample • For example, percentage men in a country based on Census data

Inferential Statistics • Estimating • For example, using the average height from a sample

Inferential Statistics • Estimating • For example, using the average height from a sample to estimate the average height in a population • Hypothesis Testing • For example, using a sample of data to test the claim that the average height in a population is 72 inches

Sample Statistics and Population Parameters • A statistic is a numerical summary based on

Sample Statistics and Population Parameters • A statistic is a numerical summary based on a sample • For example, the average weight in a sample of hospital patients in London • A parameter is a numerical summary of a population • For example, the average weight of all hospital patients in London

Sample Size versus Population Size •

Sample Size versus Population Size •

A Note on Notation •

A Note on Notation •

Research Questions Revisited • Research Question: What is the average number of words spoken

Research Questions Revisited • Research Question: What is the average number of words spoken daily among 10 year-old schoolchildren in New York City? • “The news had a segment on how young teenagers spend most of their time text messaging rather than talking, so the average number of words spoken each day can’t be more than 500. ”

Questions Revisited • Research Question: Do citizens of the United States approve or disapprove

Questions Revisited • Research Question: Do citizens of the United States approve or disapprove of legalizing marijuana use? • “I met an elderly couple from Oklahoma who really dislike marijuana, so clearly it’s widely disapproved by people in the United States. ”

Questions Revisited • Does meditating for 15 minutes each day improve happiness among college

Questions Revisited • Does meditating for 15 minutes each day improve happiness among college students in Beijing? • “I started meditating for 30 minutes each day and I’m happier, so college students who meditate must be happier, too. ”

Anecdotal Evidence • These are all anecdotal evidence, based on a limited sample size

Anecdotal Evidence • These are all anecdotal evidence, based on a limited sample size that might not be representative of the population • Anecdotal evidence is typically composed of unusual cases that we recall based on their striking characteristics • Instead of anecdotes, we should sample appropriately from the population or conduct a Census

Things to Know • A population is the entire collection of cases about which

Things to Know • A population is the entire collection of cases about which information is desired • A sample is a subset of a population • We often distinguish between descriptive and inferential statistics • Anecdotal data is highly unreliable

Part 3 Basics of Study Design

Part 3 Basics of Study Design

“To call in the statistician after the experiment is done may be no more

“To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ” – R. A. Fisher

Process of Investigation 1. 2. 3. 4. Identify a question or problem Collect relevant

Process of Investigation 1. 2. 3. 4. Identify a question or problem Collect relevant data on the topic Analyze the data Make a conclusion We collect data by designing a statistical study!

Types of Statistical Studies Crosssectional Observational Statistical Studies Longitudinal (or Panel) Experimental • Observational

Types of Statistical Studies Crosssectional Observational Statistical Studies Longitudinal (or Panel) Experimental • Observational studies entail observation and measurement, but no intervention • Experimental studies entail the application of some intervention

Observational Studies • We typically cannot prove cause and effect with observational studies •

Observational Studies • We typically cannot prove cause and effect with observational studies • A survey is a typical observational study • In a cross-sectional study, data are collected at one point in time on a set of individuals • In a longitudinal (or panel) study, data are collected over time on the same individuals

Example: Obesity Epidemic • Tracking the same people over time from 1971 to 2003,

Example: Obesity Epidemic • Tracking the same people over time from 1971 to 2003, social scientists used survey data to examine the spread of obesity through social networks • The authors conclude: “Our study suggests that obesity may spread in social networks in a quantifiable and discernable pattern. ” • This is a longitudinal observational study

Be Careful with Interpreting Results • In an observational study, there can always be

Be Careful with Interpreting Results • In an observational study, there can always be confounding (or lurking) variables affecting the results • This means that observational studies can almost never show causation • It is easier to adjust for confounding variables in an experiment

Confounding Variables • A confounding variable is a variable not included in the study

Confounding Variables • A confounding variable is a variable not included in the study design that has an effect on the variables studied • For example, if you just study sunscreen and cancer, you might find conclude that sunscreen is causing cancer (when in fact you’re omitting the level of sun exposure)

Experimental Studies • Experimental studies entail the application of some intervention (or treatment) •

Experimental Studies • Experimental studies entail the application of some intervention (or treatment) • Often used in psychology and medicine • Individuals or subjects are randomly assigned to treatment and control groups • Try to remove known effect of confounders by controlling the environmental conditions • Accordingly called randomized controlled trials

Example: Neighborhoods and Obesity • Women and children living in poor neighborhoods were randomly

Example: Neighborhoods and Obesity • Women and children living in poor neighborhoods were randomly assigned to have an opportunity to live in wealthier neighborhoods • Conclusion: “The opportunity to move from a neighborhood with a high level of poverty to one with a lower level of poverty was associated with modest but potentially important reductions in the prevalence of extreme obesity and diabetes. ”

Disadvantages of Experiments • They can be unethical if conducted on vulnerable populations or

Disadvantages of Experiments • They can be unethical if conducted on vulnerable populations or the treatment is known to be harmful • It can be difficult to monitor subjects to ensure that they are doing what they are told • Results of experiments that use animals do not necessarily generalize to humans • They can take many years, even decades, to complete

Things to Know • Observational studies entail observation and measurement, but no intervention •

Things to Know • Observational studies entail observation and measurement, but no intervention • Experimental studies entail the application of some intervention • Observational studies can be cross-sectional or longitudinal • It can be difficult to say anything about causation in observational studies because of confounding variables

Course Map Introduction to Data Categorical Data Numerical Data Descriptive Statistics Probability Tables and

Course Map Introduction to Data Categorical Data Numerical Data Descriptive Statistics Probability Tables and Relative Risk Correlation Analysis Simple Linear Regression Basics of Sampling Distribution Tests for Means Tests for Contingency Tables and ANOVA Inferences for Correlation and Regression Course Review Tests for Proportions

End of Lecture 1: Introduction to Data Stat E-100, Harvard University Ethan Fosse, Ph.

End of Lecture 1: Introduction to Data Stat E-100, Harvard University Ethan Fosse, Ph. D.