How To: Getting Minor League Play-by-Play Data, Part II
The next step for obtaining minor league play-by-play data using R
Good morning! In last Friday’s post we started walking through the process of obtaining minor league play-by-play data — you can find that here.
FYI: The tutorial posts are for premium subscribers only, but we’ll occasionally post them for everyone like we did a few months ago. You can find that post here.
Important note: The tutorial posts are going to assume that the reader has at least a basic understanding of R. If you’ve never written a line of code or worked with baseball data, there are many great introductory resources out there. A great place to start is R for Data Science by Hadley Wickham & Garrett Grolemund. It covers everything you’ll need to know to begin working with data, performing analysis, and building visualizations.
Picking up where we left off…
Last week we walked through how to pull a list of games and write a function that takes that game list as an argument. We used that function to pull play-by-play data for one day’s worth of games last season. This week, we’ll take it one step further and build a function to grab play-by-play data for an entire level (e.g., AAA, Low-A) of games for an entire season.
library(tidyverse)
library(baseballr)
library(lubridate)
library(gt)
games = mlb_schedule(2022, 12) %>%
rename(
home_team = teams_home_team_name,
away_team = teams_away_team_name,
status = status_abstract_game_state
) %>%
filter(game_type == 'R')
This time we’ll use the mlb_schedule
function from the baseballr package to retrieve the schedule for all Double-A games in 2022. Once we run this function and do some filtering, we should have a tibble with every Double-A regular season game from last season. We may also want to get data (or see if it’s available) for a different level or league — we can pull all the codes for leagues/levels by running the following…
leagues = mlb_league(2022)
leagues %>%
select(league_id, sport_id, league_name, league_num_games)
You should now have a tibble that looks similar to the one below. The value we are going to use is actually not the league_id
, but the sport_id
. As you can see below, 12 corresponds to leagues that are at the Double-A level. You can try any level (e.g., sport_id) you want.
Similar to last week, we’ll clean-up our data a bit before running it through our function to pull the play-by-play data. The main reason for this is that we may use the game list for something later and there are a bunch of columns we may not want.
Keep reading with a 7-day free trial
Subscribe to Down on the Farm to keep reading this post and get 7 days of free access to the full post archives.