From Data Visualization to Data Science Moving from




























- Slides: 28
From Data Visualization to Data Science: Moving from Descriptive to Predictive Heather J. Chapman, Ph. D HEDW 2021 Virtual Presentation April 15 2021
Research, evaluation, statistics • Both instructor and practitioner • Predictive analytics • Regression • Machine Learning • Emphasis on providing meaningful display and description of complicated results • Heather J. Chapman, Ph. D Director of Academic Analytics
1. Introduction a. What is the issue? b. Where are we now? 2. Descriptive v. Predictive 3. Institutional Considerations 4. Departmental Considerations a. Before taking on a project b. After delivering data
The ultimate reason for collecting and analyzing data is to make better decisions. . . (. . . but building a data-driven organization is not easy. )
What is the issue? ● The need for data and analytics is great across most institutions ○ Predictive analytics is a reported “top need” 77% of institutions report need for predictive analytics at their institution or in specific departments (Educause, 2015) ■ Top HEDW need for at least last 3 years 60% of institutions report using predictive analytics for decision making is a major challenge Only about 15% of institutions report having enough analysts to support data and analytics (Educause, 2012) ■ ○ ○
What is the issue? ● ● ● Competition for enrollments is increasing Improvement in student success Many accrediting bodies are increasing demand for data Performance-based funding is becoming the new “normal” “Big data” and the increasing availability of data
What is the issue? ● Death by dashboard ○ ○ ○ Access to data is no longer the problem We’ve moved from data shortage to data overkill Dashboards have become commonplace Data space has become cluttered This causes confusion, mistrust ■ ■ Where do I go for data? What do the data say? Why do these views contradict? I’ll use this view, it shows what I want (not what is “best”)
Where are we now? 4 Analytics Disciplines 1. 2. 3. 4. Descriptive Diagnostic Predictive Prescriptive (hindsight) (insight or foresight) Higher education data use is typically inefficient, using only 2 of these, and focusing on hindsight to make policy decisions.
Descriptive Statistics Inferential Statistics Measures of Central Tendency t-Test - Mean, Median ANOVA - Frequency Counts (Mode) Correlation Regression* Measures of Variability - Standard Deviation - Min/Max Values Machine Learning/AI*
What’s the difference: Descriptive v. Predictive? ● Descriptive statistics are focused on facts ○ What do the data say or show? ■ ■ What percentage of… How many of… ● Predictive statistics are focused on actions ○ What decisions can be made? ■ ■ What leads to… What is the impact of. . .
Descriptive v. Predictive DESCRIPTIVE PREDICTIVE How many students are admitted and enroll each year? Which students are most likely to admit and enroll? How many students retain from one year to the next? What factors increase a student’s chances of retaining? How many students participate in engagement activities? What is the impact of engagement activities on graduation rates? How many students graduate within 6 years? Which students have the highest odds of graduating?
When is it appropriate to use each? DESCRIPTIVE PREDICTIVE Is the program growing? Is the program effective? How many people should we invite? Are there other factors involved? Who are we serving? Do demographics affect achievement? When were enrollments the greatest? What has the most impact?
How do we move forward? Institutional Considerations
What is needed at the institutional level? 1. Decision-making culture a. Is your senior leadership committed to predictive analytics? b. Is there a general sense of cultural acceptance and use? 2. Policies a. Specific to data collection, access and use b. Are policies in place to assure the ethical use of data? E. Dahlstrom (2016). Moving the red queen forward: Maturing analytics capabilities in higher education.
What is needed at the institutional level? 3. Data Efficacy a. What is the quality of the data? b. Are data standards processes in place? c. Are the proper tools available to implement? 4. Investment/Resourcing a. Are appropriate funding resources available and specific to analytics? b. Is there enough staffing, or staff with the right skills, to undertake predictive analytics in addition to regular tasks? c. Does the institution see predictive analytics as an investment? E. Dahlstrom (2016). Moving the red queen forward: Maturing analytics capabilities in higher education.
What is needed at the institutional level? 5. Technical Infrastructure a. Does your institution have the capacity to store, manage and analyze data? 6. IT - Analytics Interaction a. What is the relationship between IT professionals and analytics professionals? b. Is there open communication between each department? E. Dahlstrom (2016). Moving the red queen forward: Maturing analytics capabilities in higher education.
How do we move forward? Departmental Considerations
Predictive Analysis: Things to consider ● Results of predictive models are always tied to some error rate ○ ○ Education data is notoriously “messy”, error prone High error rates ~ no better than chance result ● “Statistical significance” carries heavy connotation ● Most people are more likely to believe a statistical result ○ Read: think of as truth ● Analysts have an ethical responsibility to be transparent
Side Note: Statistical Significance ● Statistical significance is (nothing more than) an INDICATION that something is happening in the data ● Statistical significance is NOT a measure of TRUTH ● Statistical significance is impacted by the nature of the data ○ ○ High N size Mismatched standard deviations Mean differences between groups Relationships between variables ● Not all tests of statistical significance are created equally ○ E. g. , t-test or correlation v. regression
What do we need to think about before committing to predictive analytics? 1. Is the research question well-defined? a. No: result of predictive analyses will be biased, unclear i. Risk of people using highly error prone results 2. Is the data “clean”? Are there known issues? a. No + known issues: can model around “known” issues sometimes b. No + no known issues: result of predictive analyses will be biased, unclear i. Risk of people using highly error prone results
What do we need to think about before committing to predictive analytics? 3. Are there pitfalls, ethical dilemmas, etc? a. Yes: May be difficult to ensure results are used responsibly 4. Is enough known descriptively to support results? a. Yes or No: Statistical data is only as good as the real-world interpretation 5. Can results be easily translated from technical to useable? a. No: May not be appropriate for widespread use
What do we need to think about before committing to predictive analytics? 5. Are enough additional variables known/included to account for error? a. No: Result will be narrow and not generalizable; not valid to distribute broadly 6. Is there time to dedicate to the full life cycle? a. No: Result may be isolated, out of context, uninterpretable 7. Is there support for data cleansing, adjustments (from IT or end user)? a. No: May not be feasible to undertake
What are the responsibilities once data is delivered? ● Are data being accurately interpreted and used? ○ ○ ○ Responsibility does not end when project marked complete Train those using predictive analyses on what is appropriate to say or claim Include training on error of the model ● Predictive data has a greater risk for doing harm ● Develop process for checking for bias ○ Are any biases being masked in other fields? ■ ○ E. g. Is income a proxy for ethnicity? Check for drift over time
Final thoughts. . . ● Start small ○ ○ ● Don’t let “perfect” be the enemy of the “good” ○ ● If we wait until we have ALL the resources we need, or ALL the understanding and buy-in, we will never start Question everything ○ ● Get sense of time required, difficulty of results delivery Can we support this? All aspects of data should be scrutinized when moving to predictive analytics Become “bilingual” ○ ○ Speaking the language of statistics is not enough To be truly valuable, stakeholder language should be most important thing
Questions? heatherchapman@weber. edu