How To: Creating Pitcher Splits with Play-by-Play Data, Part I
Part one of a series to demonstrate how to create pitching splits with play-by-play data
After a few months off, our “How to” series is back! Today, Andrew Bower will walk through how to take minor league play-by-play data and create several types of pitching splits ⬇️
If coding and working with data is of no interest to you, skip down to the sections covering last night’s minor league games.
Important note: The tutorial posts assume that the reader has at least a basic understanding of R. If you've never written a line of code or worked with baseball data, there are many excellent introductory resources. [R for Data Science](https://r4ds.hadley.nz/) by Hadley Wickham & Garrett Grolemund is a great place to start. It covers everything you'll need to know to begin working with data, performing analysis, and building visualizations. The tutorial also assumes you worked through Part I and Part II and have a season's worth of data ready; if not, check them out first.
Picking up where we left off...
In February, we walked through how to pull a list of games and write a function that takes that game list as an argument. We used that function to pull play-by-play data for one day's worth of games last season (2022) in Part I, and in Part II, we took it one step further and built a function to grab play-by-play data for an entire level (e.g., AAA, Low-A) of games for an entire season. This tutorial will show you how to analyze those pitcher data further.
# loading the packages we will use
if (!requireNamespace('pacman', quietly = TRUE)){
install.packages('pacman')
}
pacman::p_load(tidyverse, here, janitor, baseballr, Lahman, rvest, retrosheet, lubridate, gt)
This tutorial works through the entire AA season from 2022. If you've worked through Part I and Part II, you should have a data frame for the whole season. If not, it will take about an hour to run through the examples and scrape those data to continue.
# for the examples in this tutorial, we will call the play-by-play data from the entire AA 2022 season 'df'
glimpse(df)
# 688344 rows x 149 columns
The play-by-play data takes some time to get used to (hereafter, PBP). Today's tutorial will familiarize you with counting balls and strikes per outing, the last five outings, the season for an individual pitcher, and comparing teams. While also showing you how to build some performance metrics to compare pitcher-batter matchups and how those vary by counts. First, let's explore the simplest: ball and strike rate within a single game.
# we will look at the Portland Sea Dogs (Boston Red Sox) vs New Hampshire Fisher Cats (Toronto Blue Jays)
dd <- df %>%
filter(game_pk == 672949)
glimpse(dd)
# 376 x 149
We now have a data frame with 376 observations associated with the first game of the 2022 season between Portland and New Hampshire. This does not necessarily mean there were 376 pitches thrown in the game. The tutorial uses this game as an example because it has lots of action to illustrate these analyses' complexity. Ultimately, Portland won this game 11-6. Drawing your attention to how these data are displayed, you might notice each record is in reverse (the game's last play is the first row you will see). This is just how they are arranged; you can rearrange them to suit your needs. Without changing the arrangement (for now), draw your attention to the index and pitchNumber variables to help you understand how to separate a single-at-bat by a pitcher.
# using the table function we want to see the relationship between
'index' and 'pitchNumber'
with(dd, table(index, pitchNumber))
When you run the above code for this game (your game might be different), you will see the following or something like it.
pitchNumber
index 1 2 3 4 5 6 7 8 9
0 74 0 0 0 0 0 0 0 0
1 11 62 0 0 0 0 0 0 0
2 2 13 53 0 0 0 0 0 0
3 1 1 13 46 0 0 0 0 0
4 0 0 1 9 29 0 0 0 0
5 0 0 0 0 8 12 0 0 0
6 0 0 0 0 0 3 7 0 0
7 0 0 0 0 0 0 0 4 0
8 0 0 0 0 0 0 0 0 2
Typically, where the index == 0 and pitchNumber == 1 will correspond to the first pitch of an at-bat. However, the index refers to actions while a batter is up. (Note: Tanner Morris led the game off with a home run (excellent way to start a season!), but his index for that first pitch is "3" because the pre-game, warm-up, and in-progress observations are “0”, “1”, “2” and have him as the batter. To further make sense of these data, you will also want to take note of the isPitch and type variables. You can see that they are broken into further levels: “TRUE” or "FALSE"; “pitch” and “pickoff” or "action,” respectively. Where isPitch == FALSE and/or type == action, you will sometimes see the index repeats the previous row value. This helps you indicate that the pitch associated with that index had an additional action (e.g., someone stole a base on that pitch).
Calculating results from a pitcher’s performance using PBP data
Caveat: You could use many different methods to do the following. There are many variables to explore within the data we have. This is the way I did it.
# isPitch vs type
with(dd, table(isPitch, type))
# 351 pitches makes sense in an 11-6 9 inning game.
# for illustrative purposes when `index` and `pitchNumber` don't follow the general rule above
dd %>%
select(
isPitch,
type,
details.description,
index,
pitchNumber,
matchup.batter.fullName
)
# let's check to ensure our plan for exploring pitch only data works
dd %>%
filter(isPitch == TRUE & type == "pitch") %>% # probably redundant
select(
details.description,
result.event,
index,
pitchNumber,
matchup.batter.fullName,
details.isBall,
details.isStrike
)
Our first step is to count the balls, strikes, and pitches each pitcher throws. To do this, we will utilize the summarize() function.
Keep reading with a 7-day free trial
Subscribe to Down on the Farm to keep reading this post and get 7 days of free access to the full post archives.