How To: Creating Pitcher Splits with Play-by-Play Data, Part III
A brand new coding tutorial + the daily minor league report
Good morning! Today we continue our “How to” series, where we walk through the process of working with data and writing R code used to make some of the graphics in the newsletter. These posts will mostly be behind the paywall, but if you upgrade to a premium subscription you’ll have full access to all of our tutorials. We’ll do our best to walk through the code in as much detail as possible, but the comments section will serve as a great place to ask questions and provide feedback. Also, feel free to email us directly if you have questions.
If coding and working with data is of no interest to you, skip down to the sections covering last night’s minor league games.
Important note: The tutorial posts assume that the reader has at least a basic understanding of R. If you've never written a line of code or worked with baseball data, there are many excellent introductory resources. R for Data Science by Hadley Wickham & Garrett Grolemund is a great place to start. It covers everything you'll need to know to begin working with data, performing analysis, and building visualizations. The tutorial also assumes you worked through Part I and Part II and have a season's worth of data ready; if not, check them out first.
Two weeks ago, David Gerth had an excellent tutorial on creating cumulative slash-line plots that you could quickly adopt to evaluate players you’re interested in tracking. In my effort to build on each previous tutorial, this week, I will show you how to take his code for his plots and add it to the play-by-play pitching data we’ve been building on to evaluate a few pitchers at a time.
As David pointed out in his post, these charts require just two libraries to be loaded and then a third if you want to display the charts together, as he did in his player spotlight post on Jasson Domínguez. Based on his example, we plan to look at different pitchers’ “slash-lines” progression across a season.
Building off of last week’s "How To," we can look at pitcher performance across the year. Where batters can sort of consistently have predictable outcomes if they swing hard and strike the ball at the right angle, pitchers have more variability. While they can and do control a lot within a game, they often depend on eight other players to perform and hope opposing hitters are unlucky. As such, some debate exists on what an equivalent pitcher slash line could or should be. Chapin Zerner discussed this earlier this year; see here. Without recreating that discussion, I will take Chapin's conclusions and show how you might look at the cumulative totals for a pitcher across a season to learn about a developing pitcher’s growth or success. In his article, he argued for a triple slash, including WHIP, LOB%, and HR per 9. I will include K% and other traditional pitching stats in the following analysis using 2022 AA data.
For this analysis, we are building off of Part I and Part II of working with pitcher play-by-play data, so the assumption is you still have a large data frame of an entire season’s worth of data. You might also have reduced the amount of data down to a specific team or a specific player. For this, go ahead and look for four different pitching prospects. I will randomly select two stand-out strike throwers, one from the right, one from the left, and two who are not strike throwers from both sides.
So, recall I have the 2022 AA season, and after sorting my data, I decided I would compare Jefry Yan (LHP, Miami Marlins, AA), Luis Reyes (RHP, Washington Nationals, AA), Avery Weems (LHP, Texas Rangers, AA) and Brandon Pfaadt (RHP, Arizona Diamondbacks, AA). During 2022, all four spent a significant amount of time at AA. Pfaadt and Weems had higher strike percentages than most, while Yan and Reyes had lower strike percentages. Now, let's see how their season looked.
pitcher_data %>%
group_by(matchup.pitcher.fullName, matchup.splits.pitcher) %>%
arrange(game_date) %>%
mutate(
cumavg = cumsum(h) / cumsum(ab),
cumobp = cumsum(ob) / cumsum(pa),
cumslug = cumsum(tb) / cumsum(ab),
) %>%
select(matchup.pitcher.fullName, matchup.splits.pitcher, game_date, cumavg, cumobp, cumslug) -> dd
dd$game_date<- as.Date(dd$game_date)
plot_1 <- dd %>%
ggplot(aes(x = game_date)) +
geom_line(aes(y = cumavg, color = "BA"), size = 1, alpha = .5) +
geom_line(aes(y = cumobp, color = "OBP"), size = 1, alpha = .5) +
geom_line(aes(y = cumslug, color = "SLUG"), size = 1, alpha = .5) +
scale_color_manual(values = c("BA" = "grey25", "OBP" = "lightpink", "SLUG" = "aquamarine4")) +
labs(title = "Cumulative Pitcher Opposing Hitter Slash Line",
x = "Month",
y = "Percentage",
color = "Opposing Slash Line") +
facet_wrap(matchup.pitcher.fullName~matchup.splits.pitcher, scales = "fixed", ncol = 2) +
theme_tufte()
plot_1
One of the advantages of visualizing “small multiples” (i.e., many similar plots together) is the ability to compare in meaningful ways quickly. You can see how Avery Weems had a rough couple of starts early but consistently got better throughout the year. You can maybe see the opposite for Yan and Reyes. Brandon Pfaadt looked like he started struggling against left-handed batters in his last starts in August before getting a ticket to AAA.
Keep reading with a 7-day free trial
Subscribe to Down on the Farm to keep reading this post and get 7 days of free access to the full post archives.