Down on the Farm

Down on the Farm

Share this post

Down on the Farm
Down on the Farm
How To: Getting Minor League Play-by-Play Data, Part II
Copy link
Facebook
Email
Notes
More
"How To" Tutorials

How To: Getting Minor League Play-by-Play Data, Part II

The next step for obtaining minor league play-by-play data using R

Josh Wittmer's avatar
Josh Wittmer
Feb 10, 2023
∙ Paid
4

Share this post

Down on the Farm
Down on the Farm
How To: Getting Minor League Play-by-Play Data, Part II
Copy link
Facebook
Email
Notes
More
5
Share

Good morning! In last Friday’s post we started walking through the process of obtaining minor league play-by-play data — you can find that here.

FYI: The tutorial posts are for premium subscribers only, but we’ll occasionally post them for everyone like we did a few months ago. You can find that post here.

Share Down on the Farm

Important note: The tutorial posts are going to assume that the reader has at least a basic understanding of R. If you’ve never written a line of code or worked with baseball data, there are many great introductory resources out there. A great place to start is R for Data Science by Hadley Wickham & Garrett Grolemund. It covers everything you’ll need to know to begin working with data, performing analysis, and building visualizations.

Picking up where we left off…

Last week we walked through how to pull a list of games and write a function that takes that game list as an argument. We used that function to pull play-by-play data for one day’s worth of games last season. This week, we’ll take it one step further and build a function to grab play-by-play data for an entire level (e.g., AAA, Low-A) of games for an entire season.

library(tidyverse)
library(baseballr)
library(lubridate)
library(gt)

games = mlb_schedule(2022, 12) %>%
	rename(
	    home_team = teams_home_team_name,
	    away_team = teams_away_team_name,
	    status = status_abstract_game_state
	) %>%
	filter(game_type == 'R') 

This time we’ll use the mlb_schedule function from the baseballr package to retrieve the schedule for all Double-A games in 2022. Once we run this function and do some filtering, we should have a tibble with every Double-A regular season game from last season. We may also want to get data (or see if it’s available) for a different level or league — we can pull all the codes for leagues/levels by running the following…

leagues = mlb_league(2022)

leagues %>% 
    select(league_id, sport_id, league_name, league_num_games)

You should now have a tibble that looks similar to the one below. The value we are going to use is actually not the league_id, but the sport_id. As you can see below, 12 corresponds to leagues that are at the Double-A level. You can try any level (e.g., sport_id) you want.

Similar to last week, we’ll clean-up our data a bit before running it through our function to pull the play-by-play data. The main reason for this is that we may use the game list for something later and there are a bunch of columns we may not want.

Keep reading with a 7-day free trial

Subscribe to Down on the Farm to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Josh
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More