Lecture 9 Pitching Analysis https baseballsavant mlb comvisuals
Lecture 9: Pitching Analysis https: //baseballsavant. mlb. com/visuals
Wisdom of Crowds Event Class Prediction (outcome) Tampa – Kansas (Superbowl) 0. 38 Liverpool – Man City 0. 48 Warriors – Nets 0. 53 Serena wins Australian open (yes/no) 0. 36 Who wins means Australian open (Djok, Med, Tsi, Kel) 0. 85 Atletico Madrid vs Chelsea (Athletic/Chelsea/Tie) 0. 35 Bruins @ Rangers
Lecture # Lecture Title Course Outline Module 2: Situational analysis 7 Valuing states (I. Markov Chains) 8 Valuing states (II. Applications to baseball) 9 Working with PITCHf/x data: Clustering pitch types 10 When should you go for it on fourth down? (I) 11 When should you go for it on fourth down? (II) 12 Guest Lecture 13 x. G: Measuring chance quality in soccer (I) 14 x. G: Measuring chance quality in soccer (II) Sections Christy: Monday, 9 am-10 am & 9 pm-10 pm Andrew: Tuesday & Thursday, 7 pm-8 pm Office hours Christy: Tuesdays, 10 am to 12 pm Andrew: Tuesday & Thursday, 8 pm to 9 pm Laurie: Thursdays, 1 pm to 3 pm PS 2 issued today (and due on 3/5) 1 st Short Project due on 3/10 1 ✔� Introduction to Sports Analytics Module 1: Measuring Team Strength and Predicting Outcomes 2 ✔� Normal models of score differentials in the NFL (I) 3 ✔� Normal models of score differentials in the NFL (II) 4 ✔� Logistic models and the Elo system 5 ✔� Poisson models for low-scoring sports 6 ✔� Simulating matches and tournaments Module 2: Situational analysis 7 Valuing states (I. Markov Chains) 8 Valuing states (II. Applications to baseball) 9 Working with PITCHf/x data: Clustering pitch types 10 When should you go for it on fourth down? (I) 11 When should you go for it on fourth down? (II) 12 Guest Lecture 13 x. G: Measuring chance quality in soccer (I) 14 x. G: Measuring chance quality in soccer (II) Module 3: Player evaluation 15 The +/- score (I. Basic and adjusted +/-) 16 The +/- score (II. Regularized adjusted +/-) 17 Streaks, momentum and the hot hand (I) 18 Streaks, momentum and the hot hand (II) Module 4: Tracking data 19 Introduction to working with tracking data 20 Guest Lecture: William Spearman (Liverpool FC) Final Project 21 -23 <PROJECT WORK IN CLASS> 24 Project presentations
Lecture Outline • Intro to Pitch. F/X data • Visualizing pitches • Scatter plots • Heatmaps • Contour plots • Identifying pitch types from trajectories • Intro to K-means clustering • Pitch analysis
Pitching data See: https: //baseballsavant. mlb. com/visuals/ • Camera-based systems that can measure the trajectory, speed, spin, break, of a pitched ball • Examples include Pitch. F/X, Track. Man, Hawk. Eye • Typical accuracies of about 1 inch • Measurement of spin & trajectory enables classification of pitch type (curve ball, knuckle ball, two/four-seam fastball, etc)
The strike zone Definition The strike zone is the volume of space through which a pitch must pass in order to be called a strike even if the batter does not swing • The official strike zone is the area over home plate from the midpoint between a batter's shoulders and the top of the uniform pants and a point just below the kneecap. • Strikes and balls are called by the home-plate umpire after every pitch has passed the batter, unless the batter makes contact with the baseball
Strike zone dimensions 22. 85’’ Height of strike zone varies with height of batter For a 73’’ tall player (6’ 1’’, 1. 85 m), height of strike zone: 22. 85’’ (58 cm) Plate width 17’’ However, for a pitch to be called a strike, part of the ball must cross over part of home plate while in this area.
Strike zone dimensions 1. 47’’ 25. 79’’ Height of strike zone varies with height of batter For a 73’’ tall player (6’ 1’’, 1. 85 m), height of strike zone: 22. 85’’ (58 cm) height of effective strike zone: 25. 79’’ Effective width: 19. 94’’ Add a border around the technical strike zone equal to half diameter of the ball = 0. 5 x 2. 94 = 1. 47’’
Where do pitchers pitch? • Can use pitch-by-pitch data to visualize all the pitches of any pitcher in 2015 -2018 season • Take the example of Justin Verlander • Strikes, balls, in-play • Left vs Right hand batters • Pitch types • Visualisation types: • Scatter • Heatmap • Contours
Pitching data # balls Positions of ball when crossing plate Pitch speed at ‘initial point’ ‘spin distances’ Code: result of pitch (B: ball, S: swinging strike, etc) Type: simple pitch outcomes (S: striker, X: in play, B: ball) Pitch_type: Type of pitch (FF: four-seam fastball, SL: slider, etc) Score Unique id for (batting each at-bat team) # strikes Pitch # Runners on 1 st, 2 nd, 3 rd
Pitching data pz Positions of ball when crossing front of home plate (in feet) px=0 defined at plate center pz=0 measures height from ground px (0, 0)
Plotting pitches Load data from 2015, 2016 and 2017 seasons Plot pitch locations for Justin Verlander Explore plotting methods
TO CODE BB_pitcher_locations. py BB_viz. py
Intensity plot (‘heatmaps’) Solution to crowded scatter plots is plot them as an intensity chart (or ’heatmap’) Example of 3 -d plotting Steps: 1. 2. Divide area of plot into pixels (or ‘bins’) 3. Color the bins according to the total count. Create a 2 -d histogram by counting the number of pitches that fall in each bin Optional: Add contours to mark regions of the image that are above (or below) a given threshold
TO CODE Heatmaps & contours
Detecting pitch types from trajectories
Brief introduction to clustering Purpose of clustering is to divide unlabeled* data into groups (or clusters), such that the individual points in each group more closely resemble each other than the points in the rest of the dataset (i. e. , find things of similar type) Form of unsupervised learning. Very useful for finding structure in data Very widely applied across many fields: • Group emails together according to content (topics) • Find people with similar shopping habits • Segment population into distinct groups based on population density
Cholera John Snow (not that one), a physician in the 1950 s, plotted the location of cholera deaths in London The Locations indicated that cases were clustered around intersections where there were polluted wells.
Types of clustering algorithms Hierarchical clustering Broadly four classes of clustering algorithms: 1. Connectivity • Look at the relationship between pairs of points or cluster • Examples includes hierarchical clusters (‘bottom up’) 2. Centroid • Specify the location of clusters centers and identify points near to those centers • Examples include k-means clustering 3. Distribution • Use PDFs to define clusters and calculate the probability of each point belonging to each cluster • E. g. expectation-maximization / Gaussian mixture models 4. Density • Calculate the density of points at any given position in parameter space and search for peaks • E. g. DBSCAN.
K-means clustering o K-means algorithm (Mac. Queen’ 67): each cluster is represented by the centre of the cluster and the algorithm converges to stable centriods of clusters. o K-means algorithm is the simplest partitioning method for clustering analysis and widely used in data mining applications. 20
K-means Algorithm Given the cluster number K, the K-means algorithm is carried out in three steps after initialisation: Initialisation: set seed points (randomly) 1) Assign each object to the cluster of the nearest seed point measured with a specific distance metric 2) Compute new seed points as the centroids of the clusters of the current partition (the centroid is the centre, i. e. , mean point, of the cluster) 3) Go back to Step 1), stop when no more new assignment (i. e. , membership in each cluster no longer changes) 21
Example: un-clustered data Generate some data by drawing from 5 bivariate normal distributions (different means, same covariance) Goal is to find five clusters and identify which cluster each points belongs to
Step 0 (a) Randomly select 5 points as the cluster centroids
Step 0 (b) Allocate each data point to a group based on the closest centroid
Iteration 1 Calculate the mean positions of points in each group and move the centroids
Iteration 2 Calculate the mean positions of points in each new group and move the centroids
Iteration 3 Calculate the mean positions of points in each new group and move the centroids
Iteration 4 Calculate the mean positions of points in each new group and move the centroids
Iteration 5 Calculate the mean positions of points in each new group and move the centroids
Iteration 6 Calculate the mean positions of points in each new group and move the centroids
Iteration 7 Calculate the mean positions of points in each new group and move the centroids
Iteration 8 Calculate the mean positions of points in each new group and move the centroids
Iteration 9 Calculate the mean positions of points in each new group and move the centroids
Iteration 10 Calculate the mean positions of points in each new group and move the centroids
Iteration 11 Calculate the mean positions of points in each new group and move the centroids
Iteration 12 Calculate the mean positions of points in each new group and move the centroids
Iteration 13 Calculate the mean positions of points in each new group and move the centroids The centroid positions have barely moved CONVERGENCE!
Unlabeled data Start
Final clusters End (13 iterations) Very simple, very effective.
Stopping criteria •
Score •
Failures Algorithm can get stuck in a local minimum, so better to run it several times and find the result that gives the best score S = 5800 S = 3479
Picking the number of clusters • # of clusters must be specified a-priori • How to do this? • Inspect data: can we guess at the number of clusters by looking at (e. g. plotting) the data? • Elbow plot: 'Elbow’ plot Score will always decrease as we add more clusters Use elbow plot to find point at which % reduction in score becomes small as we add an extra cluster
Other considerations •
Clustering pitches by trajectory
Pitch types The high speed and spin that pitches can put on the ball allow for a very diverse range of trajectories that can be used to confuse or tricker the batter. Virtually every Major League pitcher throws a combination of pitches, with starting pitchers often owning an arsenal of three or more offerings. Changeup Curveball Cutter Eephus Forkball Four-Seam fastball Knuckle-curve Screwball Sinker Slider Splitter Two-seam fastball
Pitch types The high speed and spin that pitches can put on the ball allow for a very diverse range of trajectories that can be used to confuse or tricker the batter. Virtually every Major League pitcher throws a combination of pitches, with starting pitchers often owning an arsenal of three or more offerings. Changeup Curveball Cutter Eephus Forkball Four-Seam fastball Knuckle-curve Screwball Sinker Slider Splitter Two-seam fastball
Four-Seam fastball Fastest, straightest pitch. Little to no movement. Two-seam fastball Moves downward, and depending on the release, will sometimes run in on a right handed hitter. Changeup Slower than a fastball, but thrown with the same arm motion. Curveball A breaking pitch that has more movement than just about any other pitch. Cutter Breaks away from a right handed hitter (RHH) as it reaches the plate.
Identifying pitching types Pitches can roughly identified using three properties of their trajectories: • Speed (normally measured on release) • Break in the horizontal direction (side spin) • Break in the vertical direction (top/back spin) The pitch. FX data measures break slightly differently: break length and break angle
z Break in Pitch FX x y Break length is the maximum distance (in 3 d) achieved between the trajectory the ball took and the straight line trajectory from release to where the ball crosses the plate. It measures total break, in both the horizonal and vertical direcitons actual ball trajectory Straight line path from release to plate Break length x component of break z component of break
Break Angle break angle Break angle measures the horizontal (side) component of the spin. If break_angle = 0, the ball didn't show any deviation from the straight -line between release and the plate in the x-y plane plate pitcher Top-down view actual ball trajectory Straight line path from release to plate x component of break
Clustering pitch types Data scientists at Major League Baseball have trained neural networks to identify pitch types from their trajectories. • A separate network is trained for each established pitcher • Rookies as classified using a general network for the league The automated classifications for each pitch are displayed on Gameday, At-Bat, and MLB. TV, local and national broadcasts, and scoreboards and in-stadium displays. Pre-2017 model New model Only 1 in 40 pitches incorrectly classified in new model
K-means on pitch data Challenge: use k-means clustering to identify different pitch types in a pitcher’s arsenal. Cluster pitches based on: • Release speed • Break length • Break angle Compare occupants of each detected clusters with MLB’s ‘Pitch. Net’ pitch type classifications.
-> TO CODE (BB_cluster_pitches. py)
Assignments # CHALLENGE 1 • If you didn't know how many cluster to pick, how many might you choose? (see elbow plot) • How well do we do if we don't z-score first? # CHALLENGE 2 • Repeat this for another pitcher (you'll have to use the player_names file to choose your pitcher and find their id) • You'll also need to decide how many clusters to pick
- Slides: 55