Using Correlational Topic Modeling for Automated Topic Identification





























- Slides: 29
Using Correlational Topic Modeling for Automated Topic Identification in Intelligent Tutoring Systems Stefan Slater Ryan Baker Mia Almeda Alex Bowers Neil Heffernan
Intelligent Tutoring Systems -- Widely used educational tools -- Hundreds of thousands of students -- Personalized, data-driven approaches to (and assessment of) learning ALEKS © 2017, Mc. Graw-Hill Education
Domain Modeling -- Content is arranged into knowledge components (KCs) -- KCs describe the skill(s) associated with particular problems -- Domain modeling (assembling the correct KC-item mapping) is critical for success within ITS environments (Mitrovic 2010, p. 66)
Domain Modeling -- A few existing approaches to mapping KCs to individual problems: (1) Expert content creation and coding (2) Crowd-sourced tagging and metadata (3) Analytical approaches (Desmarais, 2012; Barnes, Bitzer & Vouk, 2005) -- All approaches have had some shortcomings
Domain Modeling -- We propose using natural language processing to develop domain models, specifically using correlational topic models (CTM, Blei & Laffery, 2007). - Use the semantic content of problems as data - Easy to deploy at scale (mostly – more later) - Can make classifications for new problems based on existing domain model - Require no student data to generate
Domain Modeling -- We used the ASSISTments intelligent tutoring system (Heffernan & Heffernan, 2014) to perform this modeling. -- Teacher-authored problem content. -- 2012 -2013 dataset – data on 112, 526 problems after processing and cleaning.
Intelligent Tutoring Systems KC: Equation Solving More Than Two Steps <p> Solve for x. </p> <p> 6(11 + x) = 8(9 + x) </p> <p> </p> <p> Answer as a fraction. </p>
Intelligent Tutoring Systems KC: Proportion <p> When making tea, you use 13 spoons of sugar for every 2 quarts of tea. Which of the following equations can be used to calculate c, the number of spoons of sugar needed for 13 quarts of tea? </p> <p> </p>
Intelligent Tutoring Systems -- 51, 026 problems have skill IDs -- 128, 882 problems do not -- Need a technique to give us information on the 128, 882 problems that we don’t know much about -- That’s CTM
Data Processing -- Needed to recode mathematical terms and expressions. - CTM considers numbers as distinct units - Cannot identify fractions, decimals, equations, etc. - Identified these items through regular expressions and recoded into semantic labels (e. g. xxfracxx)
Data Processing -- Need to strip out HTML tags. -- Able to leave ASCII and Unicode markers in, they were actually quite helpful.
Data Processing -- Before processing: <p> A deep freezer has a temperature of -20 ° C when it is turned off. </p> <p> </p> <p> The temperature then rises at 0. 7 ° C per minute. </p> <p> </p> <p> Assume the number of minutes is your independent variable (x) and the current temperature is your dependent variable (y) </p> <p> </p> <p> Find 'y', the current temperature of the freezer after x minutes </p> <p> </p> <p> Write your equation in the form y = _______. </p>
Data Processing -- After processing: a deep freezer has a temperature of xxdegreesxx C when it is turned off the temperature then rises at xxdegreesxx c per minute assume the number of minutes is your independent variable x and the current temperature is your dependent variable y find y, the current temperature of the freezer after x minutes write your equation in the form y
Data Processing Problem Feature Regex Single Digits, {0 -9} Multiple Digits, {10+} Decimals (e. g. 3. 14) Fractions (e. g. 3/7) Dollar Amounts (e. g. $2. 75) Percents (e. g. 100%) Degrees (e. g. 90°) ‘Explicit’ Numbers (e. g. #4) d{1} d{2, } d{1, }/d{1, } $d{1, }(. d{1, })? %d{1, }(. d{1, })? d{1, }° #d{1, } Regex applied in decreasing order of complexity to avoid conflicts.
Modeling -- Used an iterative modeling approach, gradually increasing the number of topics and assessing perplexity scores as a measure of goodness. -- Evaluated models at k = 5, k = 15, and k = 25. Model k=5 k = 15 k = 25 Perplexity Score 319. 91 227. 40 189. 28 -- ASSISTments contains at least k = 190, and possibly as high as k = 450 or so
Results -- CTM identified three broad ‘types’ of topics in the k = 25 model. (1) True KCs (2) Reminders and scaffolding (3) System guidance Topic Type True KC Reminders/Scaffolding System Guidance Topic Count 16 4* 5
Results “True KC” Topics Topic 12: “A vs. B Comparison Problems” -- “best”, “choose”, “follow”, “part”, “two” Topic 13: “Currency Problems” -- “xxmoneyamtxx”, “number”, “cost”, “answer”, “total”
Topic 12: A vs. B Comparison “The city of Lakewood is building a new shopping center. They have three plans to choose from and they want to base their decision on the choice of the people living in the immediate area. Therefore, they hire a research group to sample the population of Lakewood to find out which shopping center plan they like the best. The research group uses the sampling method described below: The research group creates assigns each house in Lakewood with an ID number and then uses a computer to randomly select 100 different houses to answer questions about the new shopping center. What sampling method was used? ”
Topic 13: Currency “Write an equation and solve: You have $60 and your sister has $120. You are saving $7 per week and your sister is saving $5 per week. How many weeks will it be before you and your sister have the same amount of money? ”
Results “Reminder and Scaffolding” Topics Topic 17: “How To Enter Fractions” -- “answer”, “make”, “type”, “fraction”, “enter” Topic 11: “Rounding Answers” -- “nearest”, “round”, “place”, “answer”, “hundredth”
Topic 17: Entering Fractions “Using the properties of equality, find the value of x in the equation below. 11 - 5 x = 7 Type your answer as a fraction so that you give the exact answer not an estimate. ” “Find the slope of the line that passes through the following points. Write "undefined" if there is no slope. (-7, -11) (0, 10) Write your answer as a fraction if needed. ”
Topic 11: Rounding Answers “Calculate the mean of the following numbers: 5, 17, 16, 10, 16 (round to the nearest tenths place). ” “A man whose eyes are 5 feet above ground is standing on the runway of an airport 100 feet from the control tower. The person observes an air traffic controller at the window of a 132 foot tower. What is the staring distance between the man's eyes and the air traffic controller? Round to the nearest hundredths. ”
Results “System Guidance” Topics Topic 2: “You have __ attempts left” -- “left”, “attempt”, “xxexplnumxx”, “xxmanynumxx”, “xxdecimalxx” Topic 25: “Future Instructions” -- “follow”, “correct”, “select”, “subtract”, “label”
Topic 2: X Attempts Left “Sorry, that is incorrect. You have no attempts left. The answer is 0. 97. ” “ 9 x + 3 x + 10= 3(3 x + x) (You have 2 attempt(s) left. )” Topic 25: Future Instructions “The following four questions are true or false. If the answer is true, select A as the correct response. If false, select the response that makes the statement true. ”
Conclusion -- CTM identified textual features that appeared together within ASSISTments problems -- These features were not just skills, but also common phrasings and scaffolds in the system -- Appears to have high discrimination between skills (e. g. Topic 7, “Mixed & Improper Fractions” vs. Topic 19, “Whole Fractions”)
Limitations -- Computationally intense. Difficult to model high numbers of topics. -- Identified 25 topics. ASSISTments contains anywhere from 100 to 400 KCs. -- Extensive re-coding required to deal with mathematical expressions and equations (but this could be automated to some degree – please help)
Future Directions -- Evaluating domain model goodness (e. g. through knowledge tracing) -- Filtering scaffold and reminder ‘topics’ to better distill domain model - or use these topics towards feature engineering! -- Increase computational power to ensure that we obtain the best-fitting number of latent topics -- Consider the potential for skill/topic hierarchies (Vytasek, Wise & Woloshen, two days ago)
Future Directions (for your colleagues) Our lab is looking for a postdoc! This Could Be You! EMAIL: ryanshaunbaker@gmail. com
Questions And Contact slater. research@gmail. com stefanslater. com @datarefinery github: itsonlydata (soon!)