How To: Creating Pitcher Splits with Play-by-Play Data, Part II

Part two of a series demonstrating how to create pitching splits with play-by-play data

Aug 25, 2023

∙ Paid

If coding and working with data is of no interest to you, skip down to the sections covering last night’s minor league games.

Important note: The tutorial posts assume that the reader has at least a basic understanding of R. If you've never written a line of code or worked with baseball data, there are many excellent introductory resources. R for Data Science (https://r4ds.hadley.nz/) by Hadley Wickham & Garrett Grolemund is a great place to start. It covers everything you'll need to know to begin working with data, performing analysis, and building visualizations. The tutorial also assumes you worked through Part I and Part II and have a season's worth of data ready; if not, check them out first.

Share Down on the Farm

Picking up where we left off...

Last week, we utilized a complete season’s play-by-play (PBP) data for the 2022 Double-A season to look at pitchers’ splits for a single game. This was like taking a flame thrower to an ant hill. However, it was helpful to get familiar with how the data are structured and provide an approach (there are many other techniques you can take). Since R is a language, there are many ways to form a sentence to get what you want when exploring and analyzing the data for a specific need. I have no particular intention or type of analysis, formal or informal, planned; instead, I’m going into this to show you how you might explore the data and look at exciting things as they surface.

The packages I used, or anticipated using, to achieve the results in this tutorial.

# packages needed
if (!requireNamespace('pacman', quietly = TRUE)){
  install.packages('pacman')
}

pacman::p_load(tidyverse, here, janitor, baseballr, Lahman, rvest, retrosheet, lubridate, gt, ggthemes, webshot, gridExtra, patchwork, ggplotify)

Rather than re-downloading an entire season (recall, which takes about an hour), you can save your data as a .rds file to be available locally on your machine. If your season object is labeled df, you can save it to a directory of your choice and read it by the following lines. (Note: This tutorial assumes you have a sub-directory contained downstream from the directory where your `.Rproj` is labeled `data,` but I’m not here to advise on project flow).

# save your data as a rds object
saveRDS(df, here("data", "aa_2022.rds")) 

# load your data from an rds object
df <- readRDS(here("data", "aa_2022.rds"))

Look at the total df again, and for this tutorial, filter and save a subsetted data frame with your favorite team or random team.

# take a look at df
glimpse(df)

# let's check out the teams and let's choose a team at random
richmond_df <- df %>% 
  filter(home_team == "Richmond Flying Squirrels" | away_team == "Richmond Flying Squirrels")

Looking at a Complete Season of Pitches at the Plate

Looking at raw pitch totals (e.g., total pitches, balls, or strikes thrown) isn’t really all that interesting for advanced analysis. Because we know the count when the pitch is thrown, the location, velocity, movement, etc., are all more robust explanations of a pitcher’s skill than the simple “Do they throw more strikes than balls?” But these foundational steps will help build some of those more advanced metrics you pour an extra cup of coffee to read about weekly on this site. Let’s quickly revisit some of the ways we looked at the single-game data, but this time, let’s look at how those totals landed for the Richmond Flying Squirrels (San Francisco Giants AA team) for an entire season.

richmond_df %>% 
  filter(isPitch == TRUE & fielding_team == "Richmond Flying Squirrels") %>%
  group_by(fielding_team, matchup.pitcher.fullName, matchup.splits.pitcher) %>%
  summarize(
    ball = sum(details.isBall == TRUE),
    strike = sum(details.isStrike == TRUE),
    contact = sum(details.isBall == FALSE & details.isStrike == FALSE),
    pitches = sum(pitchNumber >= 1),
    strike_perc = round(strike / pitches, 3) * 100,
    ball_perc = round(ball / pitches, 3) * 100,
    contact_perc = round(contact / pitches, 3) * 100,
    perc_difference = strike_perc - ball_perc,
    pp9 = round(pitches / 9, 1),
    sp9 = round(strike / 9, 1),
    bp9 = round(ball / 9, 1),
  ) %>%
  group_by(fielding_team, matchup.splits.pitcher) %>% 
# creates team means and centered percentages for individual players
  mutate(
    across(ball:contact_perc, ~ round(mean(.), 2), .names = "{.col}_mean"),
    across(ball:contact_perc, ~ round(. - mean(.), 2), .names = "{.col}_diff"),
    across(perc_difference, ~ round(mean(.), 3), .names = "{.col}_mean"),
    across(perc_difference, ~ round(. - mean(.), 3), .names = "{.col}_diff")
    )

And here is how we might take these data and make a table to argue about the future outlook of the guys on the bump with our friends who happen to Dodger fans. But, for everyone else not interested in that particular rivalry, you could explore different teams and see if you can understand their organizational strategy for pitching. Do they want to flood the zone with strikes? Do some pitchers excel in those situations?

(Note: This tutorial is about baseball data exploration and analysis, so I leave the coding for any displayed tables or figures for another time; also, leave that to you to do as you please).

We might quickly look at this and immediately say, “Well, some of these fellas are starters, and some are relievers, yes?” Yup. You’re right, and you could break that down if you wanted. Here, you could sort of guess (unless you knew explicitly). Since he just made his first big league start, let’s focus on Kyle Harrison’s line against RHB.

He threw 50.8% of his pitches to RHB for called strikes, 37.9% were called balls, and 11.3% of his pitches to righties induced some form of contact. He threw 2.89% more strikes and 1.28% more balls to righties than the team average, but he had 4.17% less induced contact. His 1.614% of strikes_to_balls thrown suggests he was a more reliable strike thrower than the rest of the team. As evidenced by his last start against the Phillies, this reflects what your eyes could see when he pounded the zone with his fastball. This reflects the Giants organization's strategy to flood the zone with effective pitches. This table is neat but not earth-shattering or perspective-altering.

Another way we could visualize these data to answer the question, do strike throwers invite more contact of all kinds? You could break this down by type of contact, exit velocity, etc.

You could look at any pitcher and see if, on average, they throw more strikes than the rest of their team and if that also leads to more contact. (Note: it doesn’t, on average, as these are virtually non-correlated, which is again because the count likely is more meaningful, as well as pitch type, location, velocity, how scary the batter is, runners on, etc.).

Looking at Opposing Batter Slash Lines by Pitcher Splits

Keep reading with a 7-day free trial

Subscribe to Down on the Farm to keep reading this post and get 7 days of free access to the full post archives.